Research Remix

March 24, 2008

Envisioning a Biomedical Data Reuse Registry

Filed under: data reuse, Data Reuse Registry, MyResearch — Tags: , — Heather Piwowar @ 9:48 am
An idea I’ve been thinking about recently:

Envisioning a Biomedical Data Reuse Registry

Heather A. Piwowar and Wendy W. Chapman

Abstract
Repurposing research data holds many benefits for the advancement of biomedicine, yet is very difficult to measure and evaluate. We propose a data reuse registry to maintain links between primary research datasets and studies that reuse this data. Such a resource could help recognize investigators whose work is reused, illuminate aspects of reusability, and evaluate policies designed to encourage data sharing and reuse.
Motivation
The full benefits of data sharing will only be realized when we can incent investigators to share their data[1] and quantify the value created by data reuse.[2] Current practices for recognizing the provenance of reused data include an acknowledgment, a listing of accession numbers, a database search strategy, and sometimes a citation within the article. These mechanisms make it very difficult to identify and tabulate reuse, and thus to reward and encourage data sharing. We propose a solution: a Data Reuse Registry.
What is a data reuse registry?
We define a Data Reuse Registry (DRR) as a database with links between biomedical research studies and the datasets used within the studies. The reuse articles may be represented as PubMed IDs, and the datasets as accession numbers within established databases or the PubMed IDs of the studies that originated the data.
How would the DRR be populated?
We anticipate several mechanisms for populating the DRR:
* Voluntary submissions
* Automatic detection from the literature[3]
* Prospective submission of reuse plans, followed by automatic tracking
We envision collecting prospective citations in two steps. First, prior to publication, investigators visit a web page and list datasets and accession numbers reused in their research, thereby creating a DRR entry record in the DRR database. In return, the reusing investigators will be given some best-practices free-text language that they can insert into their acknowledgments section, a list of references to the papers that originated the data, some value-add information such as links to other studies that previously reused this data, and a reference to a new DRR entry record. When authors cite this DRR within their reuse study as part of their data use acknowledgement, the second step of DRR data input can be done automatically: citations in the published literature will be mined periodically to discover citations to DRR entries. These citations will be combined with the information provided when the entry was created to explicitly link published papers with the datasets they reused. The result will be searchable by anyone wishing to understand the reuse impact made by an investigator, institution, or database.
How would the DRR be used?
Information from the DRR could be used to recognize investigators whose work is reused, illuminate aspects of reusability, examine the variety of purposes for which a given dataset is reused, and evaluate policies designed to encourage data sharing and reuse.
Conclusion
While the DRR may not be a comprehensive solution, we believe it represents a starting place for finding solutions to the important problem of evaluating, encouraging, and rewarding data sharing and reuse.
Acknowledgments
HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1 R01LM009427-01.
References
1. Compete, collaborate, compel. Nat Genet. 2007;39(8).
2. Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol. 2004 Sep;22(9):1179-83.
3. Piwowar HA, Chapman WW. Identifying data sharing in the biomedical literature. Submitted to the AMIA Annual Symposium 2008.
[This DRR summary has been submitted as a poster description to AMIA 2008]

6 Comments

  1. Its a great idea – but if the general problem is one one of persuading people to acknowledge re-use in a consistent way (or more generally to do anything that falls outside the normal requirements) then you fall back on automatic detection and if this works then you’ve solved the initial problem. Mind you if it is a good route to solving that problem then I’m all for it. Well I’m all for it anyway if it can be got together.

    Comment by Cameron Neylon — March 24, 2008 @ 10:45 am

  2. Cameron, thanks for the comments.

    I think automatic detection can identify some (a great deal? I don’t know yet) reuse, but not as precisely or thoroughly as the reusers themselves could. That is why the hybrid approach seemed to make sense: get reusers to contribute what they will and then take a best guess on the rest.

    Yup, I think there are two problems: 1) coming up with a consistent way to acknowledge reuse, and 2) getting it widely adopted. I don’t think there is a consensus for (1) yet, so that is what this proposal is exploring. After we have a (1), then (2) could be encouraged through journal/database policies or the like?

    It is all a bit utopian, I know :)

    Comment by Heather Piwowar — March 24, 2008 @ 1:06 pm

  3. […] inspiration for this work was the idea of a Data Reuse Registry and associated research. As discussed, a DRR would benefit from automatic identification of data […]

    Pingback by Identifying Data Sharing in Biomedical Literature « Research Remix — March 25, 2008 @ 12:22 pm

  4. […] my early research findings for visibility, feedback, and attribution. I recently submitted a research proposal. Precedings does a spot-check of all submissions to verify appropriateness. It usually takes a day […]

    Pingback by A Centralized Proposal Repository « Research Remix — April 2, 2008 @ 8:17 am

  5. You may be interested to see CKAN.net. It is a registry of knowledge resources (including datasets) that are available for re-use. See, e.g.

    http://ckan.net/tag/read/biology

    It is still in very early stages – but hopes to go some way to addressing some of the issues you mention. Having information on how a given resource has been re-used is a really interesting idea.

    We’re currently planning a workshop for this autumn looking at how we can improve CKAN’s support for scientific data – including the possibility of creating ‘plugins’ for different existing metadata schema.

    It would be great if you might be interested in participating!

    Warm regards,

    Jonathan

    Comment by Jonathan Gray — September 1, 2008 @ 10:41 am

  6. Jonathan, neat project, thank you for the pointer.

    Agreed, a data set registry like CKAN combined with a Reuse Registry could disambiguate both the data sets themselves and the instances of reuse. Ohhh the possibilities!

    I’ll be in touch to hear more about your roadmap.
    Heather

    Comment by Heather Piwowar — September 2, 2008 @ 8:23 am


RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at WordPress.com.

%d bloggers like this: