Research Remix

July 12, 2010

Recap of iEvoBio BoF on open science, data sharing & reuse, credit.

Filed under: conferences — Tags: , , , , — Heather Piwowar @ 10:52 am

The organizers of the recent iEvoBio meeting have asked for a summary of the Birds-of-a-Feather session.  I didn’t take notes, but here is a start:

About 10 people participated in the BoF that merged the three sign-up topics “open notebook science”, “data sharing and reuse”, and “data citations and a culture of credit.”

We had an energetic and wide-ranging discussion that included participation from people with diverse backgrounds, perspectives, and opinions.  A few of the topics included:

  • the variants of open notebook science and how they are supported (or undersupported, in some cases) by Open Wet Ware
  • the need to publish minimal data slices to prevent scooping, particularly for some datatypes, and how it can lead to misinterpretation of the data by others
  • whether data-producing authors should be contacted as collaborators for reuse
  • the fact that credit is essential, yet so is remembering that our jobs are fundamentally to contribute to scientific progress
  • support for dynamic CV that included up-to-date reuse metrics for articles, data, and nontraditional outputs.

If you were there, do you have things to add?  Respond in the comments or on twitter with #ievobioBof .

I learned a lot from the perspectives of others in the discussion:  looking forward to more conversations at future meetings.

July 5, 2010

Studying reuse of GEO datasets in the published literature

Filed under: research — Tags: , , — Heather Piwowar @ 12:31 pm

Studying data sharing is great, but what we really want to know about is data reuse.


There are benefits to data sharing and archiving even if the data are never reused: for example, sharing detailed datasets likely discourages fraud. Other benefits may accrue but leave no evidence in the published literature: datasets may be used for education and training, or by interested readers to more deeply understand and validate the data-producing study.

The real value-add of data archiving, though, is in the potential for more efficient and effective scientific progress through data reuse. There have been many calls to quantify the extent and impact… to do a cost/benefit analysis. An estimate of value of reuse would help to make a business case for repository funding, an ethical case for compelling investigators to bear the personal cost of sharing data, and clarify that sharing even esoteric data is useful — as the Nature Neuroscience editorial Got Data? puts it, “After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.”

Little research has been done on the patterns and prevalence of reuse. A few superstar success stories need no analysis: Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.

They are so successful, though, that people discount them as special cases.

So what does the reuse behaviour look like for other datasets?

We don’t know. It is difficult to track reuse. There have been a few surveys, but they suffer from limited scope and self-reporting biases. I gather that download stats are poorly correlated with perceived value (I need to learn more about this). So let’s track reuse in the published literature.


Yeah. Well. Let’s. Unfortunately it won’t be simple, due to the lack of standards for data citations and the ambiguity of citation contexts:

  • There is no standard identifier for a given dataset. The accession number? DOI? The citation of the data-producing paper? The author’s name? A search strategy or set of inclusion criteria for a set of datasets?
  • Even when there is a standard identifier, or a small set of identifiers, there is no standard practice for referencing the identifier. As a mention in the methods? In the acknowledgements? In a supplementary table? Cited as an official reference?
  • Our current tools for finding and extracting these identifiers are poor. If the mentions are in full-text, we need to do full text queries across a wide range of the published literature. Google Scholar, ISI Web of Science, PubMed Central, Scirus, HighWire Press… they all have serious drawbacks. Citations are easier, but data extraction from ISI Web of Science and Scopus is still very suboptimal and not machine friendly.
  • Finally, intelligence is required in disambiguating the dataset mentions. Is the paper discussing a dataset deposit, a dataset reuse, or something else?


It isn’t simple to gather reuse patterns, but it is possible, so I’ve started. I’m keeping open notes on Open Wet Ware on my progress. Early days. Stay tuned.

March 25, 2008

Identifying Data Sharing in Biomedical Literature

Filed under: MyResearch — Tags: , , — Heather Piwowar @ 12:04 pm

I emailed AMIA again to ask for clarification on their preprint policy, and quickly received this encouraging response: Preposting is fine so long as the other sites don’t formally publish the work.” Great news, thanks AMIA.

Note: this brings my blog up-to-date on the research I’ve been doing, with the exception of one paper under review at PLoS Medicine. That one is a complex collaboration. Despite some attempts there isn’t consensus about making it open at this point.

Here is the paper we submitted to the AMIA 2008 Annual Symposium. AMIA=American Medical Informatics Association. Nature Precedings link to appear once it has been posted.

Identifying Data Sharing in Biomedical Literature
Heather A. Piwowar and Wendy W. Chapman

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using natural language processing (NLP) techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Full text


My inspiration for this work was the idea of a Data Reuse Registry and associated research. As discussed, a DRR would benefit from automatic identification of data reuse in the biomedical literature. Unfortunately, automatic identification of data reuse is a tough place to start my NLP (natural language processing) journey because I haven’t found any large, pre-existing gold standards of data reuse to use for evaluating such a system (this list of GEO “third party” data reuse papers is a start).

Identifying data sharing is easier: there are available gold standards via database links, and authors tend to use more uniform language in describing sharing than reuse. Automatically detecting data sharing could be useful to my research in other ways as well, down the road, as I look towards further sharing policy evalutation.

This data sharing identification system used very simple NLP techniques. Hope to (and will probably need to) dig into some more complex approaches as I tackle data reuse identification.

If anyone knows of other resources that list specific instances of data reuse, I’d love to hear about them!

March 24, 2008

Envisioning a Biomedical Data Reuse Registry

Filed under: data reuse, Data Reuse Registry, MyResearch — Tags: , — Heather Piwowar @ 9:48 am
An idea I’ve been thinking about recently:

Envisioning a Biomedical Data Reuse Registry

Heather A. Piwowar and Wendy W. Chapman

Repurposing research data holds many benefits for the advancement of biomedicine, yet is very difficult to measure and evaluate. We propose a data reuse registry to maintain links between primary research datasets and studies that reuse this data. Such a resource could help recognize investigators whose work is reused, illuminate aspects of reusability, and evaluate policies designed to encourage data sharing and reuse.
The full benefits of data sharing will only be realized when we can incent investigators to share their data[1] and quantify the value created by data reuse.[2] Current practices for recognizing the provenance of reused data include an acknowledgment, a listing of accession numbers, a database search strategy, and sometimes a citation within the article. These mechanisms make it very difficult to identify and tabulate reuse, and thus to reward and encourage data sharing. We propose a solution: a Data Reuse Registry.
What is a data reuse registry?
We define a Data Reuse Registry (DRR) as a database with links between biomedical research studies and the datasets used within the studies. The reuse articles may be represented as PubMed IDs, and the datasets as accession numbers within established databases or the PubMed IDs of the studies that originated the data.
How would the DRR be populated?
We anticipate several mechanisms for populating the DRR:
* Voluntary submissions
* Automatic detection from the literature[3]
* Prospective submission of reuse plans, followed by automatic tracking
We envision collecting prospective citations in two steps. First, prior to publication, investigators visit a web page and list datasets and accession numbers reused in their research, thereby creating a DRR entry record in the DRR database. In return, the reusing investigators will be given some best-practices free-text language that they can insert into their acknowledgments section, a list of references to the papers that originated the data, some value-add information such as links to other studies that previously reused this data, and a reference to a new DRR entry record. When authors cite this DRR within their reuse study as part of their data use acknowledgement, the second step of DRR data input can be done automatically: citations in the published literature will be mined periodically to discover citations to DRR entries. These citations will be combined with the information provided when the entry was created to explicitly link published papers with the datasets they reused. The result will be searchable by anyone wishing to understand the reuse impact made by an investigator, institution, or database.
How would the DRR be used?
Information from the DRR could be used to recognize investigators whose work is reused, illuminate aspects of reusability, examine the variety of purposes for which a given dataset is reused, and evaluate policies designed to encourage data sharing and reuse.
While the DRR may not be a comprehensive solution, we believe it represents a starting place for finding solutions to the important problem of evaluating, encouraging, and rewarding data sharing and reuse.
HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1 R01LM009427-01.
1. Compete, collaborate, compel. Nat Genet. 2007;39(8).
2. Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol. 2004 Sep;22(9):1179-83.
3. Piwowar HA, Chapman WW. Identifying data sharing in the biomedical literature. Submitted to the AMIA Annual Symposium 2008.
[This DRR summary has been submitted as a poster description to AMIA 2008]

Blog at