I’m happy to say that we’ve just had a Letter to the Editor published in Nature:
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
We hope publishing the argument in this high-visibility venue will inspire hallway conversations amongst scientists and influence how they view long-term data archive funding. Particularly those scientists who also wear hats in funding agencies!
The letter is currently behind a paywall. As is permitted by Nature’s preprint policies, I include the text we initially submitted below. It is very similar to what appears in the final article (linked above).
The published letter is also very short. The original article-length draft is at the bottom of this post. Needless to say, it includes nuances lost in the shorter versions. The README associated with the data has additional information about methods.
While doing this research I wrote a few blog posts about my methods and early results. Here are the links:
- Studying Reuse Of GEO Datasets In The Published Literature
- Early Results: Public Data Archiving Increases Scientific Contribution By More Than A Third
- Rough Estimate Of Papers Per Dollar
The data behind the analysis is openly available in the Dryad data archive. Please reuse it, and feel free to contact me if you have questions! Data citation:
Piwowar HA, Vision TJ, & Whitlock MC (2011). Data from: Data archiving is a good investment Dryad Digital Repository : 10.5061/dryad.j1fd7
Full text of letter upon submission:
Data archiving gives a high return on investment
Piwowar, Vision, Whitlock
As recognized by the recent NSF, NIH, and other research council requirements for data management and dissemination plans, data archiving is valuable to scientific progress. Unfortunately, funding agencies have been reluctant to make long-term commitments to data archives. Here, we compare the research productivity per dollar of data archives with existing benchmarks. We argue that ongoing investment in data archiving infrastructure provides a high scientific return on financial investment.
First, how much do archives cost? As an example, we use Dryad (datadryad.org), a data repository for the biosciences with which we are all associated. For Dryad, a relatively cost-effective archive, we estimate that data from over 10,000 publications can be curated and preserved each year for approximately $400,000.
Next, how much research is typically published per grant dollar? NSF core grants in Population and Community Ecology averaged about 3-4 papers per $100,000 from 2000-2005, according to research conducted by Savanna Reyes with Alan Tessier and Susan Mazer. Thus $400,000 in original research funding results in about 16 papers.
Finally, how productive are data archives in facilitating original research publications? It is too early to say for Dryad, but we can look to NCBI’s Gene Expression Omnibus (GEO) database for insight. To derive an estimate of data reuse, we searched the full-text of articles in PubMed Central (PMC) for mention of any of the 2,711 datasets deposited in GEO in 2007. We excluded articles whose author names overlapped those who had deposited the dataset. Extrapolating the 338 hits in PMC to all of PubMed, we estimate the GEO datasets from 2007 have made substantive third-party contributions to more than 1150 published articles in 2007-2010 alone, and reuses continue to accumulate rapidly. Details of this estimate are available at doi:10.5061/dryad.j1fd7.
Assuming that Dryad has a comparable rate of reuse and collects even 2,500 datasets per year, a $400,000 investment would contribute to more than 1,000 papers within 4 years, far greater than the accepted value of a research dollar. Some papers based on data reuse may be partially funded by additional grant support — nonetheless, the modest amount of funding needed to maintain a repository like Dryad is almost certain to generate a large scientific return on investment. To maximize the impact of the support they provide to individual investigators, research funders should include the maintenance of data archives as an integral component of their investment portfolios.
Competing financial interests: Heather Piwowar and Todd Vision receive research support from the Dryad data repository project.
Original article-length draft:
Data archiving gives a high return on investment
Piwowar, Vision, Whitlock
In many fields of science, data are rarely available for scrutiny or reuse by the broader community. While data produced within large-scale research initiatives are increasingly freely and openly available, important data captured in the “long tail” of investigator-driven research are almost inevitably lost through a combination of poor data management, hardware failure, and retirement or death of their collectors.
In recent years, many funding agencies and journals have increased their expectations for the sharing of research data, and a variety of public data archives have been established to support these policies. Unfortunately, many of these archives face uncertain futures because the recurrent funds necessary for long-term preservation are difficult to obtain. Funding agencies must weigh investment in data infrastructure, such as data archives, against investment in research itself. Do data archives merit long-term investment?
We provide evidence that data archives promise an outstanding return on investment by facilitating a productive afterlife for data that would otherwise see very limited reuse. We use as our metric the number of papers written based on archived data, relative to the maintenance cost of the archive. While not captured by our numbers, it is important to also appreciate that data archiving has benefits beyond new publications, including transparency and broadened participation.
First, consider how much archives cost. This obviously varies depending on the archive. As a benchmark, we use Dryad (datadryad.org), a data repository with which we are all associated. Dryad was launched two years ago to house datasets associated with published articles in the biosciences. Dryad has been designed to operate efficiently: budget estimates for Dryad suggest that it can curate and preserve the data from over 10,000 publications on an annual budget of $400,000.
Second, how productive is research funding, in terms of journal publications? NSF core grants in the Population and Community Ecology cluster averaged about 3-4 papers per $100,000 from 2000-2005, according to research conducted by Savanna Reyes with Alan Tessier and Susan Mazer (1). Estimates from other studies in the literature are similar (2-6). If we use the upper estimate of productivity from the Population and Community Ecology cluster, $400,000 in original research funding would result in about 16 papers.
Finally, how often do data archives facilitate novel research? It is too early to say how many research papers Dryad will enable, but we can look to comparable data repositories for insight. Within the biosciences, Genbank and the Protein Data Bank are well-known success stories, but it is sometimes suggested that these datatypes are particularly conductive to reuse. NCBI’s Gene Expression Omnibus (GEO) database contains data more typical of individual investigator-driven research: gene expression microarray data are collected under a wide range of experimental conditions, on a variety of incompatible platforms, and undergo variable processing steps.
To derive an estimate of the reuse of data in GEO, we took advantage of the conventions for citing GEO datasets through accession numbers and GEO’s integration with PubMed and PubMed Central (PMC). Using PMC, we searched the full text of papers published between 2007 and 2010 for mention of one or more of the 2,711 accession numbers assigned to data series submitted to GEO in 2007. After excluding those papers that a) had author names in common with those who deposited the data (since the original authors would presumably have access to the data even in the absence of the archive) and b) mentioned an accession number without building upon the dataset, we identified 338 papers that appear to reuse the 2007 GEO datasets in a significant way.
Because PMC contains only a subset of papers recorded in PubMed, we extrapolated to the expected number of articles in PubMed based on the ratios of papers in PMC to PubMed in this domain (measured as the number of articles indexed with the MeSH term “gene expression profiling” in PMC relative to the number of articles with the same MeSH term in all of PubMed; 2007:23%, 2008:32%, 2009:36%, 2010:25%). We estimate that, as of the end of 2010, the whole of PubMed contains 1159 papers that mention GEO accession numbers in the context of novel reuse for datasets submitted in 2007 alone. Thus, for every ten datasets that it collects, we estimate that GEO contributes to at least four papers in the following three years.
This is an underestimate of reuse for several reasons. Our screen only captures papers that attribute reuse through mention of a GEO accession number, which is common practice but not universal. Furthermore, this analysis only includes the first few years in the productive afterlife of the data. As illustrated in Figure 1, reuses of data from 2007 continue to accumulate rapidly.
Assuming that Dryad collects a low figure of 2,500 data sets per year, and that it has a rate of publishable re-use equivalent to that for GEO, a $400,000 investment in this data archive would contribute to more than 1,000 papers within 4 years. While papers based on data reuse may be partially funded by grant support for analysis effort or additional data collection, the modest amount of funding needed to maintain a repository like Dryad is almost certain to generate a large scientific return on investment.
Public data archiving can generate important new results for a small fraction of the currently accepted cost of doing science. To maximize the impact of the support they provide to individual investigators, research funders should include the maintenance of data archives as an integral component of their investment portfolios.
Supporting data and detailed methods are available as supplementary information.
This analysis was conducted under the auspices of DataONE, funded by a Cooperative Agreement through the NSF DataNET program (OCI-0830944).
1. Personal communication
2. K. W. Boyack, K. Borner, Indicator-assisted evaluation and funding of research: Visualizing the influence of grants on the number and citation counts of research papers. Journal of the American Society for Information Science and Technology 54, 447 (2003).
3. B. G. Druss, S. C. Marcus, Tracking publication outcomes of National Institutes of Health grants. The American Journal of Medicine 118, 658 (2005).
4. M. Gaughan, B. Bozeman, Using curriculum vitae to compare some impacts of NSF research grants with research center funding. Research Evaluation 11, 17 (2002).
5. D. Hendrix, An analysis of bibliometric indicators, National Institutes of Health funding, and faculty size at Association of American Medical Colleges medical schools, 1997–2007. Journal of the Medical Library Association: JMLA, 86, 324 (2008).
6. V. Larivière, B. Macaluso, É. Archambault, Y. Gingras, Which scientific elites? On the concentration of research funds, publications and citations. Research Evaluation 19, 45 (2010).