The US government has asked for our thoughts on how the NSF and other federal agencies should disseminate research results.
I come to the question with a passion for the effective use of research results, a commitment to practical solutions, and naivete about how the NSF actually works. From this perspective I offer my current rough thoughts on what the NSF — and all public science funders — should do with regard to dissemination of digital research results (= datasets, code, and publications, as we know them and in their future incarnations). I wrote them down to help me think through my responses to the RFIs.
So far the number of online responses to the RFI for data has been very low. Answering open-ended questions is difficult. Critiquing a straw man is easier, a lot more fun, and often just as revealing. Where does the vision below differ from yours? It is interesting, for example, to compare it to the vision articulated in this response — also from proponents of open research data.
Do you have strong opinions? You can respond to the White House till Jan 12th and the National Science Board till Jan 18th. Feel free to reuse text and ideas directly from here for agreement or critique.
A vision for NSF infrastructure for digital research results
(considering digital research results to include datasets, code, and publications — as we know them and in their future incarnations)
- A public science funder has both a right and a responsibility to communicate its findings in the most generative form it can. Projects funded with public money must be conducted under this premise.
- Effective communication of research results requires strong statements of principles, enforceable policies, and useful infrastructure.
- Individual disciplines and communities can opt-out of funder-wide approaches if they make a strong public case that the principles and goals are not applicable to their area, or that they plan to achieve the same goals in a different but equally-effective way.
- Costs for disseminating research results in accordance with a funder’s requirements should be included and funded as part of the cost of doing a research project. This money should be used by investigators to pay for publication services — including editing, registering, reviewing, certifying, hosting, publicising, and preserving — in a competitive marketplace.
- Anticipated benefits from disseminating a project’s research results should be included in proposal evaluation. Disseminating research results is the responsibility of the PI. Research results that have not been disseminated in accordance with policy will not be acknowledged as output of the grant for the purposes of evaluation. Measured impact of a research results, interpreted broadly, will be used for evaluation of the project and its investigators.
- The intellectual property rights of researchers and institutions must be respected, but can not infringe on the rights of the public and the scientific community to replicate and build on funded research findings in a timely manner.
- A science funder and its infrastructure for reporting research results must be nimble to stay abreast of changing norms, needs, and technologies.
- The effects of adopted policies and infrastructure tools must be systematically monitored and adjusted accordingly. Decisions should be informed by evidence whenever possible. A funder should fund collection of the relevant, actionable evidence it needs to make decisions on policy and tools: research for more effective research.
- Infrastructure should use existing commodity software when it meets the needs at a competitive price, with a preference for open source software. When funders develop their own infrastructure software it should be open source to allow outside contributions, customizations, and tailored solutions.
- Science communication infrastructure should be open to findings from all funders whenever incremental costs can be recovered.
Policies (to be enacted after 2 years notice)
- Immediate online open access to the published article-of-record and the data and code that support its findings. Embargoes or exceptions may be granted at the discretion of the program officer, especially for sensitive information such as human subject data or the location of endangered species.
- The openly available article-of-record, data, and code must be registered with the funder to be considered research findings of the grant.
- Research results must be made openly available to any one, for any purpose, with the sole condition of appropriate attribution.
- Articles, data, and code must be available for 50 (?) years after completion of the research project.
- Publicly funded websites and software must report use of research results through funder-compatible impact-tracking infrastructure.
This is infrastructure that funders ought to obtain and directly and continuously fund.
- Unique identifiers. Unique IDs for each investigator, grant, institution, and licence variety. Also unique IDs for each research result (see below). Some initiatives are already underway. There should be inter-funder and international coordination.
- An open registry of research results. Investigators (or their publishers) must register research results here immediately upon article publication. Data and code that support the article findings must be registered at the same time (or data and code may also be registered without of an associated article).
At a minimum, a record includes fields for: research object type (article, data, code), research object ID, license, grant ID, investigator IDs, institution IDs, publisherID, an abstract, keywords about the topic and methods, metadata about embargos or exemptions, and links to other directly associated research object IDs. The research object ID must resolve:
- for articles, to the full text of the article-of-record (in XML format or similar)
- for data, to the full dataset and associated metadata
- for code, to the snaphot version of the code
- The registry must have a read/write API, and would ideally be extensible through third-party client-based add-ons to support discipline-based customization.
- A mechanism to gather and report raw data on the impact of research findings. This could be done by adopting and hosting an open-source web analytics solution like Piwik (http://piwik.org) and requiring that all funded initiatives report usage statistics for activity (reading, commenting, remixing, etc) involving research results.
Impact metrics would be also be populated by machine extraction of citations from the article text linked to through the open registry of research results, thereby providing Open access to citation data that the commercial sphere has not facilitated. Funder customizations to the source should be remain open source to facilitate additional contributions and reuse. The raw impact data should be made widely available to fuel innovative products in discovery and filtering of research reports, next generation bibliometrics assessment, and policy evaluation.
- Dissemination and preservation of research results. Although successful discipline-wide models already exist for papers (journals and conferences) and some types of data, many disciplines lack appropriate data repositories and few disciplines have determined appropriate hosting and archiving solutions for code. Research funders should provide seed funding for such infrastructure until it can survive as a sustainable service like those discussed below.
It is assumed that scholarly societies, non-profits, and the commercial sector will continue to offer value-add services such as editing, mark-up, layout, organizing pre-publication peer-review, certification, dissemination, and archiving. These are envisioned as services that would be paied for with line-item funds from the project research budget for publications, data, and code. It is unlikely that funders will need to fund infrastructure in this services sector, other than occasional seed funding for innovative approaches.
The NSF should compute the subscription fees paid by institutional libraries to gain access to NSF-funded research today (for example, total subscription fees X (number of articles/number of articles reporting on NSF funded research projects)).
Starting two years from now (synchronized with the requirement that all final research products be openly accessible, and the expectation that all research projects include budget items for publication services) it should decrease the indirect costs by this amount.
Not included in the articulation of this vision, mostly due to time:
- Is it worth it to archive Very Large Data? Dryad can archive up to 10Gb per paper…. this covers the data that supports findings for most investigator-driven research, but not all. I don’t know what to do with the really big stuff. GigaScience model?
- How this ought to sync up with availability of research proposals.
- Advocating for challenges and prizes.
- goal isn’t availability, it is use
- focus on both replication *and* reuse. Neither is sufficient alone.
- focus here on data associated with final research findings. Much to suggest that other datasets should be made available for reuse as well!
- lots more that needs to be done about annotating the kind of contribution a citation makes to a research result.
- is 2 years the right amount of time before goes into effect? 50 for preservation? I don’t know, but they are the right order of magnitude.
- embargos should be used to permit exclusive use by investigators when appropriate during this transition period (particularly for longitudinal datasets that take years to collect), active IP explorations, etc. The length of the embargo should be as short as is reasonable and must be specified at the time of research reporting.