Research Remix

February 18, 2011

Early results: public data archiving increases scientific contribution by more than a third

Filed under: data reuse — Heather Piwowar @ 11:53 am

To date, most arguments about the value of data reuse have been based upon assumption or promise rather than evidence. The public archiving of research datasets has much potential to unleash additional scientific contribution, but there has been little evidence (other than a few clear success stories, such as DNA sequence comparisons in Genbank) of the impact that wide sharing of publication-related datasets has on scientific progress.

Below are early results of a large-scale study into data reuse. This analysis focusses on a single database, NCBI’s Gene Expression Omnibus, and quantifies data reuse by identifying attribution to database record accession numbers within scientific article full-text. In contrast to the broad Tracking 1000 datasets project, this GEO reuse analysis is a deep look into a single datatype, leveraging many of the PubMed and NCBI resources to allow automated and scalable data collection.

As previously introduced on this blog and as a poster at ASIS&T (flowchart), the methods are quite simple:

  • get a list of all submissions to GEO within a given year, in this case 2007
  • record the names of all authors listed in the dataset submission and of any dataset submission-related papers
  • for each accession number (of the form GSE1234), query all of PubMed Central for any occurence of the accession number within full text
  • classify the PubMed Central “hits” into those that share author names with the submitters and those that don’t
  • extrapolate the results from PubMed Central to all of PubMed, using the yearly ratio of papers in PubMed to those in PMC with MeSH indexing term “gene expression profiling”

That’s it! I did do some painstaking manual curation of the full-text citation sentences, attempting to identify accession number “hits” were mentioned merely in a passing point of discussion rather than as an attribution of reuse. Reuse instances overwhelmingly dominated and so the manual curation really isn’t necessary.

Here’s what I’ve found so far. THESE ARE PRELIMINARY RESULTS and may change as the method is refined and/or mistakes are discovered.

There were 2711 submissions to GEO in 2007.

In the three years since these datasets were deposited, the original investigators (or people with the same last names as the original investigators) have published 851 papers in PubMed Central in which they refer to their dataset accession numbers. Extrapolating that based on the ratios of papers in PMC to PubMed in this domain (2007:23%, 2008:32%, 2009:36%, 2010:25%), I estimate there are at least 3249 papers in PubMed, by the original investigators, that use or reuse 2007 GEO data.

In the same three years, author groups that did not include anyone from the original dataset submission group published 323 papers in PubMed Central referring to GEO data accession numbers from 2007. This extrapolates to 1109 secondary-use papers in all of PubMed that pay attribution to the 2007 GEO datasets through accession numbers.

(Attribution via accession numbers in the body of a paper is common in this field, but not universal. Sometimes authors attribute through referencing the data-producing paper alone, or by listing the accession number in supplementary information. These reuse attributions are not captured by this method).

Given the many reports that requesting data from authors is often unsuccessful, I suggest that most of these 1109 secondary-reuse papers could not have leveraged the dataset-related science had the datasets not been posted in a publicly available archive. This implies that within three years, GEO has enabled the science contribution behind its dataset submissions to contribute to one third more scientific publications than would have been possible had the data not been publicly archived. Furthermore, the number of these reuses is still increasing over time, unlike those of the original investigators:

We might ask whether the majority of these reuses come from just a small number of the dataset submissions, rendering large-scale archiving unnecessary. This is not the case. 369 of 2711 datasets (12%) were referred to by at least one PMC article whose authors did not include the dataset-submitters. There is a very long tail. NOTE: the 369 represents reuses in PMC. This number has not been extrapolated to all of PubMed (it is not yet obvious to me how to do that extrapolation properly).  The numbers and figures should be considered to represent just a small subset of all reuses that have occurred.

Admittedly, the analysis is crude. It does not (yet) make any attempt to account for the relative scientific impact of the original and reuse papers, nor the role of GEO data in their findings. It does not measure indirect benefits that come from data archiving, like transparency or broadened participation. The method only considers attributions that mention accession numbers, and identifying common authors based solely on their last name is very rough. It isn’t clear how generalizable these results may be to other domains and datatypes.

Nonetheless, these analyses and others like them will help us understand and communicate the benefits of data sharing, and point to further ways we can unleash the value of our scientific efforts. Stay tuned.

This analysis is in early stages. Raw data and calculations are available on google spreadsheets. Data collection code is going in the pypub project on github. Conversation, suggestions, and critique are encouraged!  I’m sharing this in its early stages to get your feedback.

ETA:  recharacterized the evidence level of unsuccessful data requests from anecdotal to reported.

March 24, 2008

Envisioning a Biomedical Data Reuse Registry

Filed under: data reuse, Data Reuse Registry, MyResearch — Tags: , — Heather Piwowar @ 9:48 am
An idea I’ve been thinking about recently:

Envisioning a Biomedical Data Reuse Registry

Heather A. Piwowar and Wendy W. Chapman

Abstract
Repurposing research data holds many benefits for the advancement of biomedicine, yet is very difficult to measure and evaluate. We propose a data reuse registry to maintain links between primary research datasets and studies that reuse this data. Such a resource could help recognize investigators whose work is reused, illuminate aspects of reusability, and evaluate policies designed to encourage data sharing and reuse.
Motivation
The full benefits of data sharing will only be realized when we can incent investigators to share their data[1] and quantify the value created by data reuse.[2] Current practices for recognizing the provenance of reused data include an acknowledgment, a listing of accession numbers, a database search strategy, and sometimes a citation within the article. These mechanisms make it very difficult to identify and tabulate reuse, and thus to reward and encourage data sharing. We propose a solution: a Data Reuse Registry.
What is a data reuse registry?
We define a Data Reuse Registry (DRR) as a database with links between biomedical research studies and the datasets used within the studies. The reuse articles may be represented as PubMed IDs, and the datasets as accession numbers within established databases or the PubMed IDs of the studies that originated the data.
How would the DRR be populated?
We anticipate several mechanisms for populating the DRR:
* Voluntary submissions
* Automatic detection from the literature[3]
* Prospective submission of reuse plans, followed by automatic tracking
We envision collecting prospective citations in two steps. First, prior to publication, investigators visit a web page and list datasets and accession numbers reused in their research, thereby creating a DRR entry record in the DRR database. In return, the reusing investigators will be given some best-practices free-text language that they can insert into their acknowledgments section, a list of references to the papers that originated the data, some value-add information such as links to other studies that previously reused this data, and a reference to a new DRR entry record. When authors cite this DRR within their reuse study as part of their data use acknowledgement, the second step of DRR data input can be done automatically: citations in the published literature will be mined periodically to discover citations to DRR entries. These citations will be combined with the information provided when the entry was created to explicitly link published papers with the datasets they reused. The result will be searchable by anyone wishing to understand the reuse impact made by an investigator, institution, or database.
How would the DRR be used?
Information from the DRR could be used to recognize investigators whose work is reused, illuminate aspects of reusability, examine the variety of purposes for which a given dataset is reused, and evaluate policies designed to encourage data sharing and reuse.
Conclusion
While the DRR may not be a comprehensive solution, we believe it represents a starting place for finding solutions to the important problem of evaluating, encouraging, and rewarding data sharing and reuse.
Acknowledgments
HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1 R01LM009427-01.
References
1. Compete, collaborate, compel. Nat Genet. 2007;39(8).
2. Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol. 2004 Sep;22(9):1179-83.
3. Piwowar HA, Chapman WW. Identifying data sharing in the biomedical literature. Submitted to the AMIA Annual Symposium 2008.
[This DRR summary has been submitted as a poster description to AMIA 2008]

Blog at WordPress.com.