Research Remix

March 25, 2008

Identifying Data Sharing in Biomedical Literature

Filed under: MyResearch — Tags: , , — Heather Piwowar @ 12:04 pm

I emailed AMIA again to ask for clarification on their preprint policy, and quickly received this encouraging response: Preposting is fine so long as the other sites don’t formally publish the work.” Great news, thanks AMIA.

Note: this brings my blog up-to-date on the research I’ve been doing, with the exception of one paper under review at PLoS Medicine. That one is a complex collaboration. Despite some attempts there isn’t consensus about making it open at this point.

Here is the paper we submitted to the AMIA 2008 Annual Symposium. AMIA=American Medical Informatics Association. Nature Precedings link to appear once it has been posted.

Identifying Data Sharing in Biomedical Literature
Heather A. Piwowar and Wendy W. Chapman

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using natural language processing (NLP) techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Full text


My inspiration for this work was the idea of a Data Reuse Registry and associated research. As discussed, a DRR would benefit from automatic identification of data reuse in the biomedical literature. Unfortunately, automatic identification of data reuse is a tough place to start my NLP (natural language processing) journey because I haven’t found any large, pre-existing gold standards of data reuse to use for evaluating such a system (this list of GEO “third party” data reuse papers is a start).

Identifying data sharing is easier: there are available gold standards via database links, and authors tend to use more uniform language in describing sharing than reuse. Automatically detecting data sharing could be useful to my research in other ways as well, down the road, as I look towards further sharing policy evalutation.

This data sharing identification system used very simple NLP techniques. Hope to (and will probably need to) dig into some more complex approaches as I tackle data reuse identification.

If anyone knows of other resources that list specific instances of data reuse, I’d love to hear about them!

Blog at