Research Remix

September 11, 2008

PSB Open Science workshop talk abstract

Filed under: conferences, MyResearch, opendata, openscience, sharingdata — Tags: , , — Heather Piwowar @ 10:39 am

The program for the Open Science workshop at PSB 2009 has been posted.  Great diversity of topics… I’m really looking forward to it.

My talk abstract is below… comments and suggestions are welcome!

Measuring the adoption of Open Science

Why measure the adoption of Open Science?

As we seek to embrace and encourage participation in open science, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify opportunities to learn and improve. It is also just plain interesting to see where we are, where we aren’t, and where we might go!

What can we measure?

Many attributes of open science can be studied, including open access publications, open source code, open protocols, open proposals, open peer-review, open notebook science, open preprints, open licenses, open data, and the publishing of negative results. This presentation will focus on measuring the prevalence with which investigators share their research datasets.

What measurements have been done? How? What have we learned?

Various methods have been used to assess adoption of open science: reviews of policies and mandates, case studies of experiences, surveys of investigators, and analyses of demonstrated data sharing behavior. We’ll briefly summarize key results.

Future research?

The presentation will conclude by highlighting future research areas for enhancing and applying our understanding of open data adoption.

March 21, 2008

Eating my own dogfood

Filed under: sharingdata — Tags: — Heather Piwowar @ 8:51 am

I guess eating dogfood really refers to companies who use their own software, rather than researchers who apply their research topics to their own research. “Practice what I preach” is more accurate, but less fun. And more, well, preachy.

ANYWAY, the point is, as I’m doing all of this research into data sharing behaviour, I’m making a point of sharing my own data. I’m not sure that anyone will ever want to use it for anything, but who knows? Maybe. From an editorial on Nature Neuroscience [doi:10.1038/nn0807-931]:

Does anyone want your data? That’s hard to predict, but the easier it becomes to request data and to receive credit for sharing it, the more likely people are to ask. After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.

It also lets me experience what it feels like to share data. It isn’t the same, I know, as sharing data from a multi-year, career-making, blood sweat and tears project, but it is something.

Sharing data is indeed hard. Specifically:

  • time consuming
  • decision-intensive (where to put it? what to share? what format to share it in?)
  • scary (what if someone finds a mistake?)
  • embarrassing (the data isn’t nearly as X as I wish I had the time to make it )

I also get to experience some of the first-hand benefits:

  • it forces additional organization
  • it helps me find my own data again later, from any computer!
  • it makes me feel proud to have made my science transparent (albeit after the fact, rather than as open notebook science)

I’m a firm believer in continual improvement. That means that I’ve shared my data now, in the best way that I have time for, rather than waiting until I can share it the way that I’d ideally like to. There are lots of things I’d like to improve:

  • Put it somewhere central and permanent (not clear where, for the esoteric dataset types that I have, but there are some neat possibilities)
  • Put it in a semantic format (!!!)
  • Document it better
  • Tag it so people can find it
  • ….

I’ll keep exploring and implementing these things as I get a chance.
If you want to put your data up but have hesitations about it, I say do it to the best of your ability right now given your current constraints. It isn’t perfect? I know, but perfect is the enemy of good enough.


  • Ditto for statistical scripts, but that’s another post.
  • Blog as data: bbgm used Dapper as a way to Semantify [the bbgm] site. Sounds fun, I’d like to try when I get a minute.
  • Have you heard this joke? “Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, you’re a mile away and you have their shoes.” I love that one :)

July 18, 2007

Shared data? Open data?

Filed under: opendata, sharingdata — Heather Piwowar @ 9:49 am

Quick wondering.  My research is on data re-use.  I struggle with what to call the source datasets.  I’d like to call them “open data” but they aren’t, necessarily.  Sometimes not free, and usually not open in a licensing sense.  I’ve been calling them “shared data” which seems ok, but isn’t mainstream and so doesn’t help link the work in to others who are perhaps interested in the same ideas.  Publicly-available data?  Even more unwieldy.

I’m on the lookout for a better phrase. Let me know if you have any suggestions?

Powered by ScribeFire.

ISMB Poster: Examining the uses of shared data

Filed under: conferences, ISMB, opendata, sharingdata — Heather Piwowar @ 9:43 am

I’m longing to catch up with reading and posting and commenting, but it will have to wait a bit longer. I’m packing to go to Vienna, for the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB)
& 6th European Conference on Computational Biology (ECCB).

I’m presenting a poster. It shows some preliminary results of looking at re-use patterns for microarray data in the PubMed Central literature.  It is up on Nature Precedings (yup, prior to the conference — Nature and ISMB both a-ok with it):

Poster G20
Examining the uses of shared data
Heather Piwowar & Douglas Fridsma
University of Pittsburgh

Does your research area re-use shared datasets?

  • Re-using data has many benefits, including research synergy and efficient resource use
  • Some research areas have tools, communities, and practices which facilitate re-use
  • Identifying these areas will allow us to learn from them, and apply the lessons to areas which underutilize the sharing and re-purposing of scientific data between investigators

Which datasets?
This preliminary analysis examines the re-use of microarray gene expression datasets.
Thousands of microarray gene expression datasets have been deposited in publicly available databases.
Many studies reuse this data, but it is not well understood for what purposes. Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the phrases “microarray” and “gene expression” to find studies which re-used microarray data.

How did we identify re-use?
We developed prototype machine-learning classifiers to identify a) studies containing original microarray data (n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK) extracted manually-selected keyword frequencies from the full-text publications as features for a Support Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents (PLoS articles prior to January 2007 containing the word “microarray,” n=200).

How did we identify patterns of re-use?
We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific MeSH term would be used given all studies with original microarray data, compared to the odds of the same term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy.

Publications with original vs. re-used microarray data have different distributions of MeSH terms (Figure 1), and occur in different proportions across various journals (Figure 2).
Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.5) compared to publications with original data.
Trends in odds ratios of MeSH terms for other attributes can be seen in Figure 3.

Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.

Future Work
We plan to refine our tool for identifying studies which re-use data, and continue studying and measuring re-use and reusability.

NOTE: typo in previous versions of the Nature Precedings abstract (should be OR<0.5 not OR<0.05).

I feel this is a slightly interesting, hypothesis-generating piece of preliminary work.  I think that it contributes most in raising the issue of data re-use.  I do hope to refine my “automatic reuse identifiers” and dig into the details and validation a bit more.

Comments and feedback welcome and encouraged, especially to help me understand if others find this interesting.

Edited to add a bit of content and update the version url.   Question:  does editing my posts do bad things to people getting them via RSS feed?  If so, please let me know.

Blog at