July 16, 2012

Many datasets are reused, not just an elite few

I’ve recently collected new data on data reuse.  Using the same methods as our Nature letter-to-the-editor analysis, I’ve looked for reuse of gene expression microarray data in PubMed Central by searching for dataset ID numbers in the full text of studies.  Studies that mention a dataset accession number but share author last names with those who deposited the dataset are excluded.

The new results look at datasets deposited into the Gene Expression Omnibus (GEO) repository between 2001 and 2009.

Results for the middle years are particularly important, since by then GEO had a lots of datasets, and between then and now there has been enough time for reuse to accumulate.  We observed reuse of more than 20% of the datasets deposited in 2003 and 17% of datasets deposited in 2007.

Note: the method used to detect reuse here is VERY CONSERVATIVE so these are minimum estimates.  It only finds reuses by papers that are in PubMed Central, and only those that are attributed by mentioning the accession number (it misses those attributed by citation to the article, for example).  Nonetheless, it does serve as a lower bound.

Analysis of the accession number mentions revealed that data reuse was driven by a broad base of datasets: about 20% of the datasets deposited between 2003 and 2007 have been reused by third parties. We note these proportions are gross underestimates since they only include reuses we observed as accession number mentions in PubMed Central; no attempt has been made to extrapolate these distribution statistics to all of PubMed, or to reflect attributions through citations. Further, many important instances of data reuse do not leave a trace in the published literature, such as those in education and training. Nonetheless, even these conservative estimates suggest that reuse finds value in a wide range of datasets, not simply a “very reusable” elite.

(manuscript-in-progress with co-author Todd Vision)

