Research Remix

July 18, 2007

ISMB Poster: Examining the uses of shared data

Filed under: conferences, ISMB, opendata, sharingdata — Heather Piwowar @ 9:43 am

I’m longing to catch up with reading and posting and commenting, but it will have to wait a bit longer. I’m packing to go to Vienna, for the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB)
& 6th European Conference on Computational Biology (ECCB).

I’m presenting a poster. It shows some preliminary results of looking at re-use patterns for microarray data in the PubMed Central literature.  It is up on Nature Precedings (yup, prior to the conference — Nature and ISMB both a-ok with it):

Poster G20
Examining the uses of shared data
Heather Piwowar & Douglas Fridsma
University of Pittsburgh

Does your research area re-use shared datasets?

  • Re-using data has many benefits, including research synergy and efficient resource use
  • Some research areas have tools, communities, and practices which facilitate re-use
  • Identifying these areas will allow us to learn from them, and apply the lessons to areas which underutilize the sharing and re-purposing of scientific data between investigators

Which datasets?
This preliminary analysis examines the re-use of microarray gene expression datasets.
Thousands of microarray gene expression datasets have been deposited in publicly available databases.
Many studies reuse this data, but it is not well understood for what purposes. Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the phrases “microarray” and “gene expression” to find studies which re-used microarray data.

How did we identify re-use?
We developed prototype machine-learning classifiers to identify a) studies containing original microarray data (n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK) extracted manually-selected keyword frequencies from the full-text publications as features for a Support Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents (PLoS articles prior to January 2007 containing the word “microarray,” n=200).

How did we identify patterns of re-use?
We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific MeSH term would be used given all studies with original microarray data, compared to the odds of the same term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy.

Publications with original vs. re-used microarray data have different distributions of MeSH terms (Figure 1), and occur in different proportions across various journals (Figure 2).
Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.5) compared to publications with original data.
Trends in odds ratios of MeSH terms for other attributes can be seen in Figure 3.

Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.

Future Work
We plan to refine our tool for identifying studies which re-use data, and continue studying and measuring re-use and reusability.

NOTE: typo in previous versions of the Nature Precedings abstract (should be OR<0.5 not OR<0.05).

I feel this is a slightly interesting, hypothesis-generating piece of preliminary work.  I think that it contributes most in raising the issue of data re-use.  I do hope to refine my “automatic reuse identifiers” and dig into the details and validation a bit more.

Comments and feedback welcome and encouraged, especially to help me understand if others find this interesting.

Edited to add a bit of content and update the version url.   Question:  does editing my posts do bad things to people getting them via RSS feed?  If so, please let me know.


  1. If you are going to ISMB/ECCB make sure you hunt down the PLoS folk who are going to be there: Evie Brown and Mark Patterson. Also PLoS Computational Biology has put together some sessions at the meeting. Details on the conference website underPLoS Track.

    Comment by Chris Surridge, Managing Editor PLoS ONE — July 18, 2007 @ 10:32 am

  2. Will do. Thanks for the suggestion!

    Comment by Heather Piwowar — July 18, 2007 @ 10:44 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: