Research Remix

February 15, 2011

Respecting the survey-takers

Filed under: MyResearch — Heather Piwowar @ 1:54 pm

We all receive waaaay too many emails and invitations to poorly-designed and poorly-executed questionnaires.  This isn’t acceptable.  Those who do research studies using questionnaires have a responsibility to those who fund, take, and learn from our surveys to do a good job.

In the interest of open research, here are a few of the techniques I’m using to respect the time and attention of my survey-takers in a large-scale online questionnaire. What do you do? Conversation welcome.

Background: Each month for the next three years I’ll be inviting recent corresponding authors from several dozen journals to take an online questionnaire as part of a research study on the impact of journal data archiving policies.

I started by sending emails to the journal editors, letting them know that I’d be inviting their corresponding authors to take part in this survey. To my happy surprise, all responses were positive: One editor asked me to clarify in my invitation that the study was not endorsed by the journal, another asked for more details, and many thanked me for letting them know and said they looked forward to hearing the results.

To create my contact list, I gather corresponding author email addresses from ISI Web of Science. Turns out it sometimes takes ISI 8 weeks to pull in new journal issues, but the data export is relatively fast and clean.

Using some custom python code, I filter the email addresses to those for just the desired month(s) and eliminate those to which I’ve already sent the survey invitation, to reduce survey fatigue. I send one initial invitation email and one reminder email about a week later. Unfortunately, the reminder email is sent to many people who’ve already taken the survey… a necessary byproduct of anonymous surveys. The initial email includes an option to unsubscribe. Less than 1% of email recipients have unsubscribed. Those who unsubscribe are automatically eliminated from the reminder email.

I’m keeping track of how long it takes respondents to complete the questionnaire. Pilot tests suggested 7-10 minutes so that is what my invitation says, but it is now looking more like 7-14 minutes, so I’ll rephrase my invitation and FAQs to be less misleading.

I’ve created a study website that includes information on the study and has an option to subscribe to updates. This has not been heavily viewed, but several people have subscribed.

I plan to compare publicly-available information for the whole corresponding author population with responses to demographic questions in my sample to understand who the sample population is and isn’t.  Because the response rate is pretty low for this sort of online survey (so far about 15%), this step is really important…. just knowing the demographics of the respondents without knowing how they compare to the larger sample can make for misleading generalizations.

Other than carefully designing the questionnaire to be as short, straight-forward, and on-topic as possible, those have been my main steps for respecting the time of my research subjects. Anyone have other tips and best practices?

Corresponding authors as research subjects

Filed under: MyResearch — Heather Piwowar @ 1:53 pm

I sent 1621 unsolicited emails earlier this week. I have mixed feelings about this.

On one hand: what right to we have to clutter up the inboxes of people we don’t know? Worse yet, ask them for 10 minutes of their already-oversubscribed time?

The emails were invitations to an online questionnaire. As such, they don’t fall into most definitions of “spam” since they do not advertise or have commercial intent. But they are a bother, and we all receive too many emails already thanks.

On the other hand, the emails were sent to corresponding authors. These authors supplied an email address as part of the publishing process, as per our current academic norms, agreeing to be corresponded with.

A solicitation to an online questionnaire is not a very satisfying correspondence. But Evidence-based Science Policy only works if we have Evidence. We need to know what people — authors — think and feel and do. To the extent that we can figure this out without bothering people, let’s do that. But it is pretty tricky to know what people are worried about just by following their public research artifacts.

So far, the response rate is for my study about 15%, normal for this audience. I’ve received one “this is the 8th survey invitation this year” complaint email response, one “hey, you have a cool last name” email response, and a whole bunch of great data.

We owe it to corresponding authors to only do the surveys we need to do, to do them carefully and to design them such that the results make a difference. Worth saying twice: We owe it to corresponding authors to only do the surveys we need to do, to do them carefully and to design them such that the results make a difference. (I cover some details of my approach to this in a following post, to facilitate exchange of ideas.)   And we need to work toward more sustainable, scalable models for gathering this evidence. We all get too many evaluation surveys these days. That said, until we have a better solution, I think some (carefully designed) unsolicited surveys are better than no surveys.

Here’s hoping that corresponding authors continue to benefit the research community by occasionally serving as research subjects as well as research producers.

September 11, 2008

PSB Open Science workshop talk abstract

Filed under: conferences, MyResearch, opendata, openscience, sharingdata — Tags: , , — Heather Piwowar @ 10:39 am

The program for the Open Science workshop at PSB 2009 has been posted.  Great diversity of topics… I’m really looking forward to it.

My talk abstract is below… comments and suggestions are welcome!

Measuring the adoption of Open Science

Why measure the adoption of Open Science?

As we seek to embrace and encourage participation in open science, understanding patterns of adoption will allow us to make informed decisions about tools, policies, and best practices. Measuring adoption over time will allow us to note progress and identify opportunities to learn and improve. It is also just plain interesting to see where we are, where we aren’t, and where we might go!

What can we measure?

Many attributes of open science can be studied, including open access publications, open source code, open protocols, open proposals, open peer-review, open notebook science, open preprints, open licenses, open data, and the publishing of negative results. This presentation will focus on measuring the prevalence with which investigators share their research datasets.

What measurements have been done? How? What have we learned?

Various methods have been used to assess adoption of open science: reviews of policies and mandates, case studies of experiences, surveys of investigators, and analyses of demonstrated data sharing behavior. We’ll briefly summarize key results.

Future research?

The presentation will conclude by highlighting future research areas for enhancing and applying our understanding of open data adoption.

March 25, 2008

Identifying Data Sharing in Biomedical Literature

Filed under: MyResearch — Tags: , , — Heather Piwowar @ 12:04 pm

I emailed AMIA again to ask for clarification on their preprint policy, and quickly received this encouraging response: Preposting is fine so long as the other sites don’t formally publish the work.” Great news, thanks AMIA.

Note: this brings my blog up-to-date on the research I’ve been doing, with the exception of one paper under review at PLoS Medicine. That one is a complex collaboration. Despite some attempts there isn’t consensus about making it open at this point.

Here is the paper we submitted to the AMIA 2008 Annual Symposium. AMIA=American Medical Informatics Association. Nature Precedings link to appear once it has been posted.

Identifying Data Sharing in Biomedical Literature
Heather A. Piwowar and Wendy W. Chapman

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using natural language processing (NLP) techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Full text


My inspiration for this work was the idea of a Data Reuse Registry and associated research. As discussed, a DRR would benefit from automatic identification of data reuse in the biomedical literature. Unfortunately, automatic identification of data reuse is a tough place to start my NLP (natural language processing) journey because I haven’t found any large, pre-existing gold standards of data reuse to use for evaluating such a system (this list of GEO “third party” data reuse papers is a start).

Identifying data sharing is easier: there are available gold standards via database links, and authors tend to use more uniform language in describing sharing than reuse. Automatically detecting data sharing could be useful to my research in other ways as well, down the road, as I look towards further sharing policy evalutation.

This data sharing identification system used very simple NLP techniques. Hope to (and will probably need to) dig into some more complex approaches as I tackle data reuse identification.

If anyone knows of other resources that list specific instances of data reuse, I’d love to hear about them!

March 24, 2008

Envisioning a Biomedical Data Reuse Registry

Filed under: data reuse, Data Reuse Registry, MyResearch — Tags: , — Heather Piwowar @ 9:48 am
An idea I’ve been thinking about recently:

Envisioning a Biomedical Data Reuse Registry

Heather A. Piwowar and Wendy W. Chapman

Repurposing research data holds many benefits for the advancement of biomedicine, yet is very difficult to measure and evaluate. We propose a data reuse registry to maintain links between primary research datasets and studies that reuse this data. Such a resource could help recognize investigators whose work is reused, illuminate aspects of reusability, and evaluate policies designed to encourage data sharing and reuse.
The full benefits of data sharing will only be realized when we can incent investigators to share their data[1] and quantify the value created by data reuse.[2] Current practices for recognizing the provenance of reused data include an acknowledgment, a listing of accession numbers, a database search strategy, and sometimes a citation within the article. These mechanisms make it very difficult to identify and tabulate reuse, and thus to reward and encourage data sharing. We propose a solution: a Data Reuse Registry.
What is a data reuse registry?
We define a Data Reuse Registry (DRR) as a database with links between biomedical research studies and the datasets used within the studies. The reuse articles may be represented as PubMed IDs, and the datasets as accession numbers within established databases or the PubMed IDs of the studies that originated the data.
How would the DRR be populated?
We anticipate several mechanisms for populating the DRR:
* Voluntary submissions
* Automatic detection from the literature[3]
* Prospective submission of reuse plans, followed by automatic tracking
We envision collecting prospective citations in two steps. First, prior to publication, investigators visit a web page and list datasets and accession numbers reused in their research, thereby creating a DRR entry record in the DRR database. In return, the reusing investigators will be given some best-practices free-text language that they can insert into their acknowledgments section, a list of references to the papers that originated the data, some value-add information such as links to other studies that previously reused this data, and a reference to a new DRR entry record. When authors cite this DRR within their reuse study as part of their data use acknowledgement, the second step of DRR data input can be done automatically: citations in the published literature will be mined periodically to discover citations to DRR entries. These citations will be combined with the information provided when the entry was created to explicitly link published papers with the datasets they reused. The result will be searchable by anyone wishing to understand the reuse impact made by an investigator, institution, or database.
How would the DRR be used?
Information from the DRR could be used to recognize investigators whose work is reused, illuminate aspects of reusability, examine the variety of purposes for which a given dataset is reused, and evaluate policies designed to encourage data sharing and reuse.
While the DRR may not be a comprehensive solution, we believe it represents a starting place for finding solutions to the important problem of evaluating, encouraging, and rewarding data sharing and reuse.
HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1 R01LM009427-01.
1. Compete, collaborate, compel. Nat Genet. 2007;39(8).
2. Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol. 2004 Sep;22(9):1179-83.
3. Piwowar HA, Chapman WW. Identifying data sharing in the biomedical literature. Submitted to the AMIA Annual Symposium 2008.
[This DRR summary has been submitted as a poster description to AMIA 2008]

March 20, 2008

A review of journal policies for sharing research data

Filed under: MyResearch — Tags: , , , , — Heather Piwowar @ 1:00 pm

Inspired by the reception to this blog post, I systematically reviewed journal data sharing policies with gene expression microarray data as a use case. The brief and extended abstracts are below. Supplementary information is here. Full paper to be written prior to presentation in Toronto this June. I’m planning to finish writing the paper in the open, so I’d love to hear your comments.

ETA: Now up at Nature Precedings. ps mom ETA = edited to add

Piwowar HA, Chapman WW (2008) A review of journal policies for sharing research data. Accepted to ELPUB2008 (International Conference on Electronic Publishing): Open Scholarship: Authority, Community and Sustainability in the Age of Web 2.0

Background: Sharing data is a tenet of science, yet commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. The purpose of this study is to understand the current state of data sharing policies within journals, the features of journals which are associated with the strength of their data sharing policies, and whether the strength of data sharing policies impact the observed prevalence of data sharing.
Methods: We investigated these relationships with respect to gene expression microarray data in the journals that most often publish studies about this type of data. We measured data sharing prevalence as the proportion of papers with submission links from NCBI’s Gene Expression Omnibus (GEO) database.
We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access).
Results: Of the 70 journal policies, 18 (26%) made no mention of sharing publication-related data within their Instruction to Author statements. Of the 42 (60%) policies with a data sharing policy applicable to microarrays, we classified 18 (26% of 70) as weak and 24 (34% of 70) as strong.
Existence of a data sharing policy was associated with the type of journal publisher: half of all commercial publishers had a policy compared to 82% of journals published by an academic society. All four of the open-access journals had a data sharing policy. Policy strength was associated with impact factor: the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.5, and 6.0. Policy strength was positively associated with measured data sharing submission into the GEO database: the journals with no data sharing policy, a weak policy, and a strong policy had median data sharing prevalence of 11%, 19%, and 29% respectively.
Conclusion: This review and analysis begins to quantify the relationship between journal policies and data sharing outcomes and thereby contributes to assessing the incentives and initiatives designed to facilitate widespread, responsible, effective data sharing.

Extended abstract:


Blog at