To date, most arguments about the value of data reuse have been based upon assumption or promise rather than evidence. The public archiving of research datasets has much potential to unleash additional scientific contribution, but there has been little evidence (other than a few clear success stories, such as DNA sequence comparisons in Genbank) of the impact that wide sharing of publication-related datasets has on scientific progress.
Below are early results of a large-scale study into data reuse. This analysis focusses on a single database, NCBI’s Gene Expression Omnibus, and quantifies data reuse by identifying attribution to database record accession numbers within scientific article full-text. In contrast to the broad Tracking 1000 datasets project, this GEO reuse analysis is a deep look into a single datatype, leveraging many of the PubMed and NCBI resources to allow automated and scalable data collection.
- get a list of all submissions to GEO within a given year, in this case 2007
- record the names of all authors listed in the dataset submission and of any dataset submission-related papers
- for each accession number (of the form GSE1234), query all of PubMed Central for any occurence of the accession number within full text
- classify the PubMed Central “hits” into those that share author names with the submitters and those that don’t
- extrapolate the results from PubMed Central to all of PubMed, using the yearly ratio of papers in PubMed to those in PMC with MeSH indexing term “gene expression profiling”
That’s it! I did do some painstaking manual curation of the full-text citation sentences, attempting to identify accession number “hits” were mentioned merely in a passing point of discussion rather than as an attribution of reuse. Reuse instances overwhelmingly dominated and so the manual curation really isn’t necessary.
Here’s what I’ve found so far. THESE ARE PRELIMINARY RESULTS and may change as the method is refined and/or mistakes are discovered.
There were 2711 submissions to GEO in 2007.
In the three years since these datasets were deposited, the original investigators (or people with the same last names as the original investigators) have published 851 papers in PubMed Central in which they refer to their dataset accession numbers. Extrapolating that based on the ratios of papers in PMC to PubMed in this domain (2007:23%, 2008:32%, 2009:36%, 2010:25%), I estimate there are at least 3249 papers in PubMed, by the original investigators, that use or reuse 2007 GEO data.
In the same three years, author groups that did not include anyone from the original dataset submission group published 323 papers in PubMed Central referring to GEO data accession numbers from 2007. This extrapolates to 1109 secondary-use papers in all of PubMed that pay attribution to the 2007 GEO datasets through accession numbers.
(Attribution via accession numbers in the body of a paper is common in this field, but not universal. Sometimes authors attribute through referencing the data-producing paper alone, or by listing the accession number in supplementary information. These reuse attributions are not captured by this method).
Given the many reports that requesting data from authors is often unsuccessful, I suggest that most of these 1109 secondary-reuse papers could not have leveraged the dataset-related science had the datasets not been posted in a publicly available archive. This implies that within three years, GEO has enabled the science contribution behind its dataset submissions to contribute to one third more scientific publications than would have been possible had the data not been publicly archived. Furthermore, the number of these reuses is still increasing over time, unlike those of the original investigators:
We might ask whether the majority of these reuses come from just a small number of the dataset submissions, rendering large-scale archiving unnecessary. This is not the case. 369 of 2711 datasets (12%) were referred to by at least one PMC article whose authors did not include the dataset-submitters. There is a very long tail. NOTE: the 369 represents reuses in PMC. This number has not been extrapolated to all of PubMed (it is not yet obvious to me how to do that extrapolation properly). The numbers and figures should be considered to represent just a small subset of all reuses that have occurred.
Admittedly, the analysis is crude. It does not (yet) make any attempt to account for the relative scientific impact of the original and reuse papers, nor the role of GEO data in their findings. It does not measure indirect benefits that come from data archiving, like transparency or broadened participation. The method only considers attributions that mention accession numbers, and identifying common authors based solely on their last name is very rough. It isn’t clear how generalizable these results may be to other domains and datatypes.
Nonetheless, these analyses and others like them will help us understand and communicate the benefits of data sharing, and point to further ways we can unleash the value of our scientific efforts. Stay tuned.
This analysis is in early stages. Raw data and calculations are available on google spreadsheets. Data collection code is going in the pypub project on github. Conversation, suggestions, and critique are encouraged! I’m sharing this in its early stages to get your feedback.
ETA: recharacterized the evidence level of unsuccessful data requests from anecdotal to reported.