I’ve had this draft kicking around for three months, waiting for me to give it a bit of love to make it, well, less preachy. In the mean time, though, it is getting old and the message isn’t out. I’d like to send it to PLoS Biology as a Perspective. What do you think, should I just hit send, or do you have some quick recommendations to help me improve it and get it out the door? Suggestions in comments, email, or at this Friendfeed thread are welcome!
A new task for NSF reviewers: Recognizing the value of data reuse
Heather A Piwowar
As of January 2011, the NSF requires that all grant applications include a data management plan. NSF grant reviewers have a new responsibility: evaluating whether a given data management plan adds to the merit of a proposal.
This responsibility ought to be taken seriously. Policies without attention often become merely an exercise in paper-pushing. Large NIH grants also require a data sharing plan, but the NIH explicitly disallows consideration of this plan as part of the merit criteria. As a result, many investigators dismiss the NIH guidance as “toothless” , and evidence suggest it has little effect on rates of data sharing .
When we don’t value data sharing, we essentially disregard potential scientific contributions. Data sharing has clearly benefited scientific progress in several arenas — genetics, thanks to open archiving of DNA sequences, for example — but many consider the circumstances of these run-away successes to be unique.
To provide a quantitative estimate of the magnitude of data reuse in a different domain, my colleagues and I recently attempted to identify reuse of relatively messy and complex datatype: gene expression microarray data [3, 4]. In brief, we searched the full-text of articles in PubMed Central for mention of accession number assigned to datasets submitted to the Gene Expression Omnibus data archive in 2007. We considered all PubMed Central articles that mentioned GEO accession numbers and had an author list that overlapped the original dataset submission group to be uses by the original data collection team. The remaining PubMed Central articles that mentioned GEO accession numbers were considered third-party reuse. After extrapolating these findings to all of PubMed, we estimate that third-party data reuse accounts for at least an additional 35% of data use beyond the contributions made by the original investigator team. Importantly, the number of dataset reuses is still accumulating rapidly, unlike uses by the original investigators:
Perhaps the majority of these reuses come from just a small number of the dataset submissions, rendering large-scale archiving unnecessary? This is not the case. Of the 2711 datasets, 364 (13%) were referred to by at least one PubMed Central article whose authors did not include the dataset-submitters. (Note the 13% only includes reuses observed in PMC and thus represents a small subset of reuses in all of PubMed or the scientific literature more broadly.)
By valuing data sharing plans that promise to unleash these additional reuses, NSF reviewers can ensure we are getting our maximum scientific progress for our funding dollar. Rewarding investigators who take this step will encourage others to make similar plans. Rewarding data sharing is important because otherwise data sharing can often be seen as a competitive disadvantage for individual investigators. Encouraging investigators to include data sharing costs within project budgets, respect data publication embargoes and exceptions where important to protect sensitive data, and reward dataset reuse to will also facilitate behaviour that supports the common good.
According to the general NSF guidelines, “The Data Management Plan will be reviewed as an integral part of the proposal, coming under Intellectual Merit or Broader Impacts or both, as appropriate for the scientific community of relevance.”  Arguments can certainly made that enabling broad and multidisciplinary contributions to a field through access to data can help achieve broader impacts. Crucially, the case for intellectual merit is also strongly supported by the NSF Grant Proposal Guide, which includes the criteria “Is there sufficient access to resources?” and “How important is the proposed activity to advancing knowledge and understanding within its own field or across different fields?” These issues deserve the attention of NSF review panels.
Providing wide-spread access to the dataset that underlies statistical findings is often a key to building upon robust evidence. Some plans take this a step further. The journal Biostatistics now highlights the publication of truly reproducible results . Authors are encouraged to submit supporting data and code. A “reproducibility editor” attempts to replicate the findings: those that are reproducible receive a special indication on the first page of the article. Surely publishing in such a venue increases the understanding of research findings, and allows them to be built on more efficiently and effectively. Plans to publish with this rigor ought to contribute to the recognized worth of the project.
Of course rewarding plans only works if they are indeed carried out. Several of the directorates make it clear that compliance will indeed be monitored. For example, the Engineering directorate states that compliance to plans will be considered in subsequent proposals by the PI and Co-PIs under “Results of prior NSF support.” Other directorates should clarify the implications of non-compliance with proposed data management plans.
In addition to valuing data sharing as part of the scoring of a proposal, NSF reviewers have the opportunity to help communicate and guide community norms through their review comments to investigators. Reviewers should acquaint themselves with best data management practices so that suggestions can be made when appropriate. For example, evidence abounds that sharing upon request is often unreliable and discriminatory and data hosted on lab websites is not sufficient for long-term preservation: data archives are a much better solution (and easy, too!).
To increase transparency, this implicit articulation of community expectations ought to be publicly stated whenever possible, particularly as the expectations evolve. Many NSF directorates have jump-started this process by issuing specific guidances. The guidances do indeed reflect different community norms. For example, expectations around timeliness of data availability ranges from “immediately upon study publication” [NSF FAQ, ENG Directorate], to “data should be submitted as soon as possible, but no later than two years after collection” [OCE Directorate] to “within one year of the expiration of the grant award” [SES Directorate] . NSF directorates who have not yet issued specific guidances (e.g., Biological Sciences, Computer & Information Science & Engineering) ought to do so immediately, even if just to emphasize their commitment to serious evaluation of data sharing plans.
I hope the NIH and other medical funders quickly follow through on their recent Joint Statement Of Purpose  to expand efforts in data dissemination and begin to include dissemination plans in funding proposal evaluations ; the sooner we give our reviewers the ability to recognize scientific data sharing as part of the intrinsic value of a research proposal, the sooner we will all benefit.
1. Tucker J. Motivating Subjects: Data Sharing in Cancer Research. PhD Dissertation, Science and Technology Studies, Virginia Tech (2009). http://scholar.lib.vt.edu/theses/available/etd-09182009-161937/
2. Piwowar HA. Who shares? Who Doesn’t? Factors associated with openly archiving raw research data. PLoS ONE [accepted].
3. Piwowar HA, Vision TJ, Whitlock MC (2011). Data archiving is a good investment. Nature, 473 (7347), 285-285 DOI: 10.1038/473285a. (preprint text)
4. Piwowar HA, Vision TJ, Whitlock MC (2011) Data from: Data archiving is a good investment. Dryad Digital Repository. doi:10.5061/dryad.j1fd7
Edited a few hours after posting to reword and reorganize.