Research Remix

August 31, 2010

Dear publisher, is the data open?

Filed under: opendata — Tags: , , — Heather Piwowar @ 10:13 pm

Publishers make article text available under a variety of copyright terms. Data, however, are not copyrightable. So what are we allowed to do with them, these datums and datasets within and beside article text? It isn’t clear. Few publisher sites say. It matters. So let’s ask.

On behalf of the Open Knowledge Foundation and benefitting from very useful feedback from a number of colleagues, Peter Murray-Rust and I recently sent email to PLoS, BMC, and Nature, asking them to confirm the openness of their data. The email is below. A slightly different email was sent to Mendeley, asking whether their data is open. All email queries and responses can be browsed at the Is It Open Data website. Furthermore, you can feel free to initiate your own enquiry from there. (And we’d love volunteers to help tweak the code to make the enquiry site even more useful.)

Peter Murray-Rust will highlight the responses-to-date in the #solo10 Green Chain Reaction session at the Science Online London conference later this week.

While this effort won’t answer all surrounding questions, hopefully it will clarify a few policies, illuminate outstanding issues, and liberate some text and data mining efforts on the way.

Subject: Enquiry about data openness at [Publisher]

Dear [Publisher],

I’m a postdoc researcher with NESCent, studying scientific data sharing and reuse. I’m writing to you, with Peter Murray-Rust, on behalf of the Open Knowledge Foundation. The Open Knowledge Foundation (OKF) is a non-profit global organization dedicated to the creation, dissemination and labelling of Open Knowledge.

On behalf of the OKF, we are writing to a large number of science publishers to ask for confirmation of their policies with respect to data published within their journals.

There is now great public interest in the Open availability of scientific data for validating scientific findings, detecting fraud and exploring new hypotheses. It is generally accepted by publishers that data per se are not copyrightable: several statements by publisher associations have made this point explictly. The Association of Learned and Professional Society Publishers (ALPSP) and International Association of Scientific, Technical, & Medical Publishers (STM) issued a joint statement in 2006 recommending that “research data should be as widely available as possible.” ( The 2007 Brussels Declaration from the STM states in part:

“Raw research data should be made freely available to all researchers.
Publishers encourage the public posting of the raw data outputs of research.
Sets or sub-sets of data that are submitted with a paper to a journal should
wherever possible be made freely accessible to other scholars.”

Combined with the acceptance and increasingly widepread adoption of the Panton Principles (, it is now possible to articulate policies that are consistent with the publication and reuse of Open Data.

We would like to ask your for clarification on several points with respect to your journals. It will help everyone if your answers are clear so that users of your material can know what they may and may not do without requesting further permission.

1. May users extract raw data and metadata (contextual facts about data collection) from supplementary information published in your journal?

2. May users extract raw data and metadata from figures, tables, and text in the narrative of your published articles?

3. May users extract this information from freely available articles and supplementary information, as well as those that are available by subscription only? For the latter, users would obtain access through an existing subscription.

4. May the extracted data be used as Open Data [1,2] without discrimination against users, groups, or fields of endeavor?

5. May users expose the extracted data as Open Data [1,2], in a manner consistent with the Panton Principles ( Specifically, may they expose the extracted data on the internet under a Public Domain, PDDL ( or CC0 waiver (

6. May users obtain articles and supplementary materials (other than audio and video) from your website via automated means for the purposes of extracting raw data, if it is done in a manner that does not place undue burden on your resources? Users would obtain access through an existing subscription where necessary.

7. Will you consider displaying the OKF’s “Open Data” button ( as a means of clarifying to readers and users the Open parts of your material?

Our questions are being asked through the OKF’s IsItOpen(Data) service (, which has been designed to clarify in what sense published and online datasets are actually open. IsItOpen(Data) saves everyone time by allowing a question to be asked just once and making the reply permanently visible in a high-profile site.

On behalf of the scientific community, thank you in advance for your response. The clear labelling of Openness will save scientists hundreds of years’ work per year in asking permission and speculating. Enabling open access data, both for use and reuse, will help to validate published findings, discourage fraud and misconduct, and explore new research areas. Your clear support for these principles will demonstrate the value you place on these activities and surely benefits science.

We look forward to hearing from you. Could you let us know the timeframe in which we might expect a response?


Heather Piwowar,

Peter Murray-Rust,

on behalf of the Open Knowledge Foundation,



Sent by “Is It Open Data?” A service which helps scholars (and others) to request information about the status and licensing of data and content.

Disclaimer: This message and any reply that you make will be published on the internet for anyone to access and copy. For more information see:

ETA: Removed link to responses for now.

September 11, 2008

PSB Open Science workshop talk abstract

Filed under: conferences, MyResearch, opendata, openscience, sharingdata — Tags: , , — Heather Piwowar @ 10:39 am

The program for the Open Science workshop at PSB 2009 has been posted.  Great diversity of topics… I’m really looking forward to it.

My talk abstract is below… comments and suggestions are welcome!

Measuring the adoption of Open Science

Why measure the adoption of Open Science?

As we seek to embrace and encourage participation in open science, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify opportunities to learn and improve. It is also just plain interesting to see where we are, where we aren’t, and where we might go!

What can we measure?

Many attributes of open science can be studied, including open access publications, open source code, open protocols, open proposals, open peer-review, open notebook science, open preprints, open licenses, open data, and the publishing of negative results. This presentation will focus on measuring the prevalence with which investigators share their research datasets.

What measurements have been done? How? What have we learned?

Various methods have been used to assess adoption of open science: reviews of policies and mandates, case studies of experiences, surveys of investigators, and analyses of demonstrated data sharing behavior. We’ll briefly summarize key results.

Future research?

The presentation will conclude by highlighting future research areas for enhancing and applying our understanding of open data adoption.

July 18, 2007

Shared data? Open data?

Filed under: opendata, sharingdata — Heather Piwowar @ 9:49 am

Quick wondering.  My research is on data re-use.  I struggle with what to call the source datasets.  I’d like to call them “open data” but they aren’t, necessarily.  Sometimes not free, and usually not open in a licensing sense.  I’ve been calling them “shared data” which seems ok, but isn’t mainstream and so doesn’t help link the work in to others who are perhaps interested in the same ideas.  Publicly-available data?  Even more unwieldy.

I’m on the lookout for a better phrase. Let me know if you have any suggestions?

Powered by ScribeFire.

ISMB Poster: Examining the uses of shared data

Filed under: conferences, ISMB, opendata, sharingdata — Heather Piwowar @ 9:43 am

I’m longing to catch up with reading and posting and commenting, but it will have to wait a bit longer. I’m packing to go to Vienna, for the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB)
& 6th European Conference on Computational Biology (ECCB).

I’m presenting a poster. It shows some preliminary results of looking at re-use patterns for microarray data in the PubMed Central literature.  It is up on Nature Precedings (yup, prior to the conference — Nature and ISMB both a-ok with it):

Poster G20
Examining the uses of shared data
Heather Piwowar & Douglas Fridsma
University of Pittsburgh

Does your research area re-use shared datasets?

  • Re-using data has many benefits, including research synergy and efficient resource use
  • Some research areas have tools, communities, and practices which facilitate re-use
  • Identifying these areas will allow us to learn from them, and apply the lessons to areas which underutilize the sharing and re-purposing of scientific data between investigators

Which datasets?
This preliminary analysis examines the re-use of microarray gene expression datasets.
Thousands of microarray gene expression datasets have been deposited in publicly available databases.
Many studies reuse this data, but it is not well understood for what purposes. Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the phrases “microarray” and “gene expression” to find studies which re-used microarray data.

How did we identify re-use?
We developed prototype machine-learning classifiers to identify a) studies containing original microarray data (n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK) extracted manually-selected keyword frequencies from the full-text publications as features for a Support Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents (PLoS articles prior to January 2007 containing the word “microarray,” n=200).

How did we identify patterns of re-use?
We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific MeSH term would be used given all studies with original microarray data, compared to the odds of the same term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy.

Publications with original vs. re-used microarray data have different distributions of MeSH terms (Figure 1), and occur in different proportions across various journals (Figure 2).
Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.5) compared to publications with original data.
Trends in odds ratios of MeSH terms for other attributes can be seen in Figure 3.

Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.

Future Work
We plan to refine our tool for identifying studies which re-use data, and continue studying and measuring re-use and reusability.

NOTE: typo in previous versions of the Nature Precedings abstract (should be OR<0.5 not OR<0.05).

I feel this is a slightly interesting, hypothesis-generating piece of preliminary work.  I think that it contributes most in raising the issue of data re-use.  I do hope to refine my “automatic reuse identifiers” and dig into the details and validation a bit more.

Comments and feedback welcome and encouraged, especially to help me understand if others find this interesting.

Edited to add a bit of content and update the version url.   Question:  does editing my posts do bad things to people getting them via RSS feed?  If so, please let me know.

July 12, 2007

Presentation on Citation Rate for Shared Data

Filed under: citations, conferences, opendata, publishingdata — Heather Piwowar @ 10:36 am

Whoosh… where has the time gone.

A few weeks ago, I attended and presented at the NLM Biomedical Informatics Trainee conference. My presentation was well-received, despite (or perhaps because of) the fact that it was the last one on the last day. The questions period was more of a “rah-rah” time, as it turned out, with several people jumping in to make more broad statements in support of open access and author addendums. Great stuff.

The talk was a quick 15-minute overview of this paper:

Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3):e308.

I’ve posted my slides at Nature Precedings; the PDF file is of the PowerPoint notes pages which contain the full-text of my talk.

Comments, questions, votes: I’d love to hear from you. Turning comments off on this post, so that any discussion can be undiluted at the Nature Precedings site (and/or at the full-text at PLoS ONE). Please note the CC licence; use this work as you will.

Other talks at the conference which I found relevant to my work included:

  • Sharing Personal Health Information within Social Networks – Meredith Skeels, University of Washington
  • Contextual Analysis of Variation and Quality in Human-Curated Gene Ontology Annotations – W. John MacMullen, University of North Carolina, Chapel Hill
  • A Study of Experimental Information Management in Biomedical Research – Nicholas Anderson, University of Washington
  • Sequential Search Result Refinement of the Medical Literature – Len Tanaka, University of Texas School of Health Information Sciences at Houston

oh, and Stanford in June is right up there with Paris in the springtime. Wow, beautiful!

May 22, 2007

Nonresponse to data sharing requests

Filed under: motivations, opendata, OpenDataProgressReport — Heather Piwowar @ 10:37 am

A few years ago, as I expressed frustration due to lack of a reply from a corresponding author, a professor summarized his experience: one third of authors do not reply when contacted, one third reply but are not able or willing to supply requested data, and one third reply and do supply the information.

I’ve since run across two published reports which quantify the nonresponse to data sharing requests. Does anyone have others?

As reported in a Nature editorial:
[Nature 444, 653-654 (7 December 2006) | doi:10.1038/444653b; Published online 6 December 2006]

The need for more data sharing has just been amply demonstrated by Jelte Wicherts, a psychologist specializing in research methods at the University of Amsterdam, who tried to check out the robustness of statistical analyses in papers published in top psychology journals.

He selected the November and December 2004 issues of four journals published by the American Psychological Association (APA), which requires its authors to agree to share their data with other researchers after publication. In June 2005, Wicherts wrote to each corresponding author requesting data, in full confidence, for simple reanalysis. Six months and several hundred e-mails later, he abandoned the mission, having received only a quarter of the data sets. He reported his failure in an APA journal in October (J. M. Wicherts et al. Am. Psychol. 61, 726–728; 2006).

The abstract of the original article:
[Wicherts JM et al. The poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726–728; 2006]

The origin of the present comment lies in a failed attempt to obtain, through e-mailed requests, data reported in 141 empirical articles recently published by the American Psychological Association (APA). Our original aim was to reanalyze these data sets to assess the robustness of the research findings to outliers. We never got that far. In June 2005, we contacted the corresponding author of every article that appeared in the last two 2004 issues of four major APA journals. Because their articles had been published in APA journals, we were certain that all of the authors had signed the APA Certification of Compliance With APA Ethical Principles, which includes the principle on sharing data for reanalysis. Unfortunately, 6 months later, after writing more than 400 e-mails–and sending some corresponding authors detailed descriptions of our study aims, approvals of our ethical committee, signed assurances not to share data with others, and even our full resumes-we ended up with a meager 38 positive reactions and the actual data sets from 64 studies (25.7% of the total number of 249 data sets). This means that 73% of the authors did not share their data.

The second example (also referenced in a prominent editorial) is slightly more positive, but still disappointing.
[Kyzas PA, Loizou KT, Ioannidis JPA. Selective reporting biases in cancer prognostic factor studies. J Natl Cancer Inst 2005;97:1043–55]
[L. M. McShane, D. G. Altman, and W. Sauerbrei. Identification of Clinically Useful Cancer Prognostic Factors: What Are We Missing? J Natl Cancer Inst, July 20, 2005; 97(14): 1023 – 1025.]

…when a report suggested that mortality data had been collected, but no usable data were available in the publication, we communicated with the primary investigators. When there was no response within 2 months, a second communication attempt was made.
…For 22 of 64 studies, even though we contacted their primary investigators, we could not retrieve any additional data. Seventeen of the primary investigators did not reply at all; and five responded and stated that they were not able to retrieve the raw data.

One third, one quarter, two-thirds.

What a sorry state of affairs.
In some ways it is understandable. Sharing data is hard. People are busy.
But isn’t sharing data part of a scientist’s job description?


Sharing Data Angst

Filed under: motivations, opendata — Heather Piwowar @ 9:50 am

A Nature editorial on data sharing.

A fair share
Nature 444, 653-654 (7 December 2006) | doi:10.1038/444653b; Published online 6 December 2006

Many of the points made for psychology are also relevant in biomedicine. For example, “Their discipline is ‘softer’ than some others: rarely do data on issues such as playground bullying or the usefulness of psychotherapy reveal really clear-cut answers.” A lack of clear-cut answers certainly sounds familiar to those working in cancer genetics.

The article discusses the sorry state of data sharing, theoretical reasons why it might be that way, and a few potential solutions. Good stuff, but the editorial failed to dig into their promising first sentence.

“The concept of sharing primary data is generating unnecessary angst in the psychology community.”

Does the concept of sharing data generate unnecessary angst? Does it actually generate angst, or is it mostly laziness or selfishness or fear? If angst, is the angst indeed unwarranted? To what extent does sharing data in fact lead to additional stresses for authors?

I’d love to see research into the reasons why scientists do not share data, and whether their reasons are upheld by events. This knowledge would allow us to address the underlying issues deterring authors from making their data available, which is bound to be more effective for long-term goals than simply relying on requirements from funding agencies and journals.

Blog at