Research Remix

July 5, 2010

Studying reuse of GEO datasets in the published literature

Filed under: research — Tags: , , — Heather Piwowar @ 12:31 pm

Studying data sharing is great, but what we really want to know about is data reuse.


There are benefits to data sharing and archiving even if the data are never reused: for example, sharing detailed datasets likely discourages fraud. Other benefits may accrue but leave no evidence in the published literature: datasets may be used for education and training, or by interested readers to more deeply understand and validate the data-producing study.

The real value-add of data archiving, though, is in the potential for more efficient and effective scientific progress through data reuse. There have been many calls to quantify the extent and impact… to do a cost/benefit analysis. An estimate of value of reuse would help to make a business case for repository funding, an ethical case for compelling investigators to bear the personal cost of sharing data, and clarify that sharing even esoteric data is useful — as the Nature Neuroscience editorial Got Data? puts it, “After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.”

Little research has been done on the patterns and prevalence of reuse. A few superstar success stories need no analysis: Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.

They are so successful, though, that people discount them as special cases.

So what does the reuse behaviour look like for other datasets?

We don’t know. It is difficult to track reuse. There have been a few surveys, but they suffer from limited scope and self-reporting biases. I gather that download stats are poorly correlated with perceived value (I need to learn more about this). So let’s track reuse in the published literature.


Yeah. Well. Let’s. Unfortunately it won’t be simple, due to the lack of standards for data citations and the ambiguity of citation contexts:

  • There is no standard identifier for a given dataset. The accession number? DOI? The citation of the data-producing paper? The author’s name? A search strategy or set of inclusion criteria for a set of datasets?
  • Even when there is a standard identifier, or a small set of identifiers, there is no standard practice for referencing the identifier. As a mention in the methods? In the acknowledgements? In a supplementary table? Cited as an official reference?
  • Our current tools for finding and extracting these identifiers are poor. If the mentions are in full-text, we need to do full text queries across a wide range of the published literature. Google Scholar, ISI Web of Science, PubMed Central, Scirus, HighWire Press… they all have serious drawbacks. Citations are easier, but data extraction from ISI Web of Science and Scopus is still very suboptimal and not machine friendly.
  • Finally, intelligence is required in disambiguating the dataset mentions. Is the paper discussing a dataset deposit, a dataset reuse, or something else?


It isn’t simple to gather reuse patterns, but it is possible, so I’ve started. I’m keeping open notes on Open Wet Ware on my progress. Early days. Stay tuned.


  1. Hi, Heather. Excellent posts. Beautifully written (and quite funny here, “Yeah. Well. Let’s.”). Important points on an important topic.

    Comment by Hope Leman — July 5, 2010 @ 3:09 pm

  2. Hi Heather,
    Great post. I’ve taken a look at the OWW content and it looks great too. I’m not too much of a fan of GEO’s interface so I made a very simplified version of my own ( and ended up also writing a script to download data via command line (will share script soon). I’ll be keeping an eye on your work, quite interesting.

    ~ Ricardo

    Comment by Ricardo — July 7, 2010 @ 7:09 pm

  3. Hi Heather. This is an important research project. Can I persuade you to list it on the Research in progress page of the Open Access Directory? The entry there could point back to this blog post and your notes at OpenWetWare. It should help spread the word and could stimulate some useful collaboration.

    Comment by Peter Suber — July 8, 2010 @ 7:01 am

  4. Wow–this is even more important than I realized if Peter Suber is interested. I wonder if Victoria Stodden has done anything on the matter of reuse. She certainly does a lot of valuable work on the matter of preparing data for optimal use from the point of creation. Here are some of her papers:

    Are you going to the Open Science Summit, Peter?

    Heather will monitor it remotely.

    Comment by Hope Leman — July 8, 2010 @ 7:33 am

  5. This sounds like a great initiative, Heather. Finding evidence of re-use of data was one of the key challenges to be addressed in developing the criteria for BioMed Central’s Open Data award. Good luck.

    Comment by Iain Hrynaszkiewicz — July 9, 2010 @ 2:14 am

  6. Thanks for the comments, everyone. Good to know the project resonates. I’ll keep you posted.

    Peter, I’ve added the project and links to the Research In Progress page of the OAD, thank you for the invitation!

    Comment by Heather Piwowar — August 3, 2010 @ 2:58 pm

  7. Thanks, Heather, and good luck.

    Comment by Peter Suber — August 3, 2010 @ 5:32 pm

  8. […] data reuse is important.  Tracking data reuse is currently […]

    Pingback by Tracking dataset citations using common citation tracking tools doesn’t work « Research Remix — November 9, 2010 @ 2:39 pm

  9. […] previously introduced on this blog and as a poster at ASIS&T (flowchart), the methods are quite […]

    Pingback by Early results: public data archiving increases scientific contribution by more than a third « Research Remix — February 18, 2011 @ 12:14 pm

  10. […] Studying Reuse Of GEO Datasets In The Published Literature […]

    Pingback by Full text and details for Nature letter “Data archiving is a good investment” « Research Remix — May 19, 2011 @ 3:54 pm

  11. I love this i want more on this subject
    any references please?

    Comment by Joe Smith — June 27, 2011 @ 6:55 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: