Research Remix

October 8, 2010

A look into the revised NSF data sharing policy

Filed under: Notes, Policies — Tags: , , , , — Heather Piwowar @ 6:09 am

Curious about details on the NSF’s revised policy on Dissemination and Sharing of Research Results?  I’ve been digging into the documents released by the NSF and its Directorates.  Here are my notes, in case they are useful for someone:  excerpts from the docs, grouped by topic.


  • [SES] Division of Social and Economic Sciences
  • [EAR] Division of Earth Sciences
  • [ENG] Engineering Directorate
  • [OCE] Division of Ocean Sciences
  • [IODP] Integrated Ocean Drilling Program
  • [MPS]  Mathematical and Physical Sciences Directorate

What is considered “data”/research results covered by this policy ?

  • “may include, but is not limited to: data, publications, samples, physical collections, software and models”  FAQ
  • “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants”  AAG
  • “Investigators and grantees are encouraged to share software and inventions created under the grant or otherwise make them or their products widely available and usable.”  AAG
  • “Qualitative resources.   If it is appropriate for other researchers to have access to them, the investigators should specify a time at which they will be made generally available, in an appropriate form and at a reasonable cost.”  SES
  • “In addition, complete information on how an experiment was conducted and any unusual stimulus materials should be made available, so that failures to replicate will not turn out to depend on one scientist’s incomplete understanding of another’s procedure.”  SES
  • “Mathematical and computer models.  Investigators should plan to make these models available to others wanting to apply them to other data sets or experimental situations. In some cases, the descriptions in published articles are sufficient; more often, it will be necessary for investigators to prepare fully documented and robust versions of these models, typically on disk, so that they can be provided to others.”  SES
  • “Preservation of all data, samples, physical collections and other supporting materials needed for long- term earth science research and education”  EAR
  • “Experimental Research: In experimental research, individuals, be they people, animals, or objects, are subjected to preplanned conditions and their responses tabulated in some fashion. Investigators should plan to make these tabulated data available to other investigators requesting them” SES
  • “Data archives must include easily accessible information about the data holdings, including quality assessments”  EAR
  • “Archiving of both physical and digital data must be addressed in the plan”  ENG
  • “Under the following definitions, all data must be included in the DMP that result fully or in part from activities supported by ENG.”  ENG
  • “Research data are formally defined as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings” by the U.S. Office of Management and Budget (1999).”  ENG
  • “The basic level of digital data to be archived and made available includes (1) analyzed data and (2) the metadata that define how these data were generated…. Analyzed data are (but are not restricted to) digital information that would be published, including digital images, published tables, and tables of the numbers used for making published graphs.  Necessary metadata are (but are not restricted to) descriptions or suitable citations of experiments, apparatuses, raw materials, computational calculation input conditions”  ENG
  • “These are data that are or that should be published in theses, dissertations, referred journal articles, supplemental data attachments for manuscripts, books and book chapters, and other print or electronic publication formats.”  ENG
  • “What data are not included at the basic level? The Office of Management and Budget statement (1999) specifies that this definition does not include “preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.” Raw data fall into this category as “preliminary analyses.””  ENG
  • “Describe the types of data and products that will be generated in the research, such as images of astronomical objects, spectra, data tables, time series, theoretical formalisms, computational strategies, software, and curriculum materials.”  MPS astonomy
  • Particular attentionshould be paid to data sets that are products of well-defined surveys.”  MPS

Where?  Public repository?  other?

  • “There is no public database for my type of data. What can I do to provide data access? Contact the cognizant NSF Program Officer for assistance in this situation.”  FAQ
  • “Quantitative Social and Economic Data Sets.  This may be the Inter-University Consortium for Political and Social Research (ICPSR) at the University of Michigan, but other public archives are also available.”  SES
  • “institutional archives that are standard for a particular discipline (e.g. IRIS for seismological data, UNAVCO for GPS data)” EAR
  • “Experimental research. SES will work with the research community to identify and resolve problems with developing and establishing centralized archives.”  SES
  • “to other investigators requesting them” SES
  • “Where no data or sample repository exists for the collected data or samples, metadata must be prepared and made available. The Principal Investigator (PI) is required to address alternative strategies for complying with the general philosophy of sharing research products and data as described above”  OCE
  • “The PI is invited to discuss this issue with NSF Program Officers in advance of submitting proposals.”  OCE
  • “for most ocean data there are designated National Data Centers where data must be deposited… Appendix I. National Data Centers”  OCE
  • “For some special programs and focused community initiatives, alternative database activities exist… Principal Investigators are encouraged to submit their data to these databases when appropriate. Since such databases may not provide long-term archival capabilities, such submission will satisfy the Principal Investigator’s obligations only if the database submits the data to one of the National Data Centers….  Appendix III: Other Database Activities…. Appendix IV. Sample Repositories”  OCE
  • Experimental Research:  “at a minimum along the lines suggested by Geoffrey Loftus in his editorial in the January, 1993, issue of Memory and Cognition”  SES  [Loftus, G.R. (1993). Editorial Comment. Memory & Cognition, 21(1), 1-3.  pdf]
  • “Describe your plans, if any, for providing such general access to data,including websites maintained by your research group, and direct contributions to publicdatabases (e.g., the Protein Data Bank, Cambridge Crystallographic Data Centre,Inorganic Crystal Structure Database in Karlsruhe, Zeolite Structure Database).”  MPS
  • “Finally, note as well any anticipated inclusionof your data into databases that mine the published literature (e.g., PubChem, NISTChemistry WebBook).”

Who needs access:  researchers, educators, public?

  • “The National Science Foundation is committed to the principle that the various forms of data collected with public funds belong in the public domain.”  SES
  • But it is a bit confused.  Even within one paragraph:  “The National Science Foundation is committed to the principle that the various forms of data collected with public funds belong in the public domain. Therefore, the Division of Social and Economic Sciences has formulated a policy to facilitate the process of making data that has been collected with NSF support available to other researchers.”  SES
  • “for research and education” EAR
  • “Data inventories should be published or entered into a public database periodically and when there is a significant change in type, location or frequency of such observations.” EAR
  • “Policies for public access and sharing should be described”  ENG
  • “samples and data to research scientists (Science Party members and postmoratorium researchers), educators, museums, and outreach institutions” IODP
  • “interested parties” MPS


  • “The expectation is that all data will be made available after a reasonable length of time.” FAQ
  • “One standard of timeliness is to make the data or samples accessible immediately after publication.” FAQ
  • “However, what constitutes a reasonable length of time will be determined by the community of interest through the process of peer review and program management” FAQ
  • “Quantitative Social and Economic Data Sets: For appropriate data sets, researchers should be prepared to place their data in fully cleaned and documented form in a data archive or library within one year after the expiration of an award.”  SES
  • “For those programs in which selected principle investigators have initial periods of exclusive data use, data should be made openly available as soon as possible, but no later than two (2) years after the data were collected. This period may be extended under exceptional circumstances, but only by agreement between the Principal Investigator and the National Science Foundation. For continuing observations or for long-term (multi-year) projects, data are to be made public annually.”  EAR
  • “Publication delay policies (if applicable) must be clearly stated.  Investigators are expected to submit significant findings for publication quickly that are consistent with the publication delay obligations of key partners, such as industrial members of a research center.”  ENG
  • “Public release of data should be at the earliest reasonable time. A reasonable standard of timeliness is to make the data accessible immediately after publication, where submission for publication is also expected to be timely.”  ENG
  • “Principal Investigators are required to submit all environmental data collected to the designated National Data Centers (Appendix I) as soon as possible, but no later than two (2) years after the data are collected. Inventories (metadata) of all marine environmental data collected should be submitted to the designated National Data Centers within sixty (60) days after the observational period/cruise. For continuing observations, data inventories should be submitted periodically if there is a significant change in location, type or frequency of such observations.”  OCE
  • “Also describe your practiceor policies regarding the release of data for access, for example whether data are posted before or after formal publication.”  MPS-AST

Data retention and preservation

  • “Minimum data retention of research data is three years after conclusion of the award or three years after public release, whichever is later.”  ENG
  • “Exceptions requiring longer retention periods may occur when data supports patents, when questions arise from inquiries or investigations with respect to research, or when a student is involved, requiring data to be retained a timely period after the degree is awarded.”  ENG
  • “Research data that support patents should be retained for the entire term of the patent”  ENG
  • “Longer retention periods may also be necessary when data represents a large collection that is widely useful to the research community. For example, special circumstances arise from the collection and analysis of large, longitudinal data sets that may require retention for more than three years. Project data-retention and data-sharing policies should account for these needs”  ENG
  • “If maintenance of a web site ordatabase is the direct responsibility of your group, provide information about the period of timethe web site or data base is expected to be maintained.”  MPS-AST
  • “Describe how data will be archived and how preservation of access will be handled. Forexample, will hardcopy notebooks, instrument outputs, and physical samples be stored ina location where there are safeguards against fire or water damage? Is there a plan totransfer digitized information to new storage media or devices as technological standardsor practices change? Will there be an easily accessible index that documents where allarchived data are stored and how they can be accessed?”  MPS-CHE

Program-specific additional requirements

  • several noted some programs, institutions, communities may have more stringent requirements.  A few (OCE) go into some specifics.

Reporting, review, and consequences

  • “The Data Management Plan will be reviewed as an integral part of the proposal, coming under Intellectual Merit or Broader Impacts or both, as appropriate for the scientific community of relevance.”  GPG, MPS
  • MPS Divisions will rely heavily on the merit review process in this initial phase to determinethose types of plan that best serve each community and update the information accordingly.  MPS
  • “NSF program management will implement these policies for dissemination and sharing of research results, in ways appropriate to field and circumstances, through the proposal review process; through award negotiations and conditions; and through appropriate support and incentives for data cleanup, documentation, dissemination, storage and the like.”  AAG
  • “Within the proposal review process, compliance with these data guidelines will be considered in the Program Officer’s overall evaluation of a Principal Investigator’s record of prior support.” EAR
  • “Efficiency and effectiveness of the DMP will be considered by NSF and its reviewers during the proposal review process.”  ENG
  • “After an award is made, data management will be monitored primarily through the normal Annual and Final Report process and through evaluation of subsequent proposals.  Subsequent proposals. Data management must be reported in subsequent proposals by the PI and Co-PIs under “Results of prior NSF support.””  ENG
  • “Strategies and eventual compliance with this policy will be evaluated not only by proposal peer review but also through project monitoring by NSF program officers, by division and directorate Committees of Visitors, and by the National Science Board.”  ENG
  • “Plans for the handling of data and other products will be considered in the review process.”  OCE
  • “Annual reports, required for all projects, should address progress on data and research product sharing. The Division of Ocean Sciences requires that final reports document compliance or explain why it did not occur. In cases where the final report is due before the required data or sample submission, the PI must report submission of metadata and plans for final submission. The PI should notify the cognizant Program Officer by e-mail after final data and/or sample submission.”  OCE
  • “Within the proposal review process, compliance with these data guidelines will be considered in the Program Officer’s overall evaluation of a Principal Investigator’s record of prior support.”  OCE
  • “Many of the proposals to DMS that require significant data management plans will beinterdisciplinary submissions… DMS expects principal investigators to address the customary data practices of partner disciplines in their proposals’data management plans, and reviewers are likely to be asked to comment on the suitability of those plans from the perspectives of the relevant disciplines.”  MPS-DMS


  • All documents recognize the special needs of sensitive (eg human subjects) data and the need to protect IP rights.
  • “A valid Data Management Plan may include only the statement that no detailed plan is needed, as long as the statement is accompanied by a clear justification. “ GPG
  • “It is acceptable to state in the Data Management Plan that the project is not anticipated to generate data or samples that require management and/or sharing.  PIs should note that the statement will be subject to peer review.”  FAQ
  • “legal rights to intellectual property [..] Such incentives do not, however, reduce the responsibility that investigators and organizations have as members of the scientific and engineering community, to make results, data and collections available to other researchers.”  AAG
  • “General adjustments and, where essential, exceptions to this sharing expectation may be specified by the funding NSF Program or Division/Office for a particular field or discipline to safeguard the rights of individuals and subjects, the validity of results, or the integrity of collections or to accommodate the legitimate interest of investigators. “  AAG
  • “For example, human subjects protection requires removing identifiers, which may be prohibitively expensive or render the data meaningless in research that relies heavily on extensive in-depth interviews.”  SES
  • “These guidelines are considered to be a binding condition on all EAR-supported projects” EAR
  • not peer review?  “Exceptions to these data guidelines require agreement between the Principal Investigator and the NSF Program Officer.”  EAR
  • “Some proposals may involve proprietary or other restricted data. For example, projects having proprietary information that will eventually lead to commercialization, such as [..].  In addition, membership agreements, contracts, involvement with other agencies, and similar obligations may place some restrictions on data sharing.  The proposal’s DMP would address the distinction between released and restricted data and how they would be managed.”  ENG
  • “Exceptions to the basic data-management policy should be discussed with the cognizant program officer before submission of such proposals.” ENG
  • “if you plan to provide data and images on your website, will the website contain disclaimers, or conditions regarding the use of the data in other publications or products? If the data or products (e.g., images) are copyrighted (by a journal, for example), how will this be noted on the website?”  MPS-AST


  • “Should the budget and its justification specifically address the costs of implementing the Data Management Plan?  As long as the costs are allowable in accordance with the applicable cost principles, and necessary to implement the Data Management Plan, such costs may be included (typically on Line G2) of the proposal budget, and justified in the budget justification.”  FAQ
  • “It is NSF’s strong expectation that investigators will share with other researchers, at no more than incremental cost”  FAQ  “no more than incremental cost” means that they can charge researchers to recover costs.
  • “These plans should cover how and where these materials will be stored at reasonable cost, and how access will be provided to other researchers, generally at their cost.”  SES

What to put in the data management plan?

  • “This supplement may include types of research output expected to be created, standards to be used, policies for sharing, provisions for reuse, and plans for preservation.”  GPG with emphasis added
  • “The DMP should clearly articulate how “sharing of primary data” is to be implemented…. The DMP should describe the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project. It should then describe the expected types of data to be retained…. The DMP should describe the period of data retention… The DMP should describe the specific data formats, media, and dissemination approaches that will be used to make data available to others, including any metadata.”  ENG
  • “It should outline the rights and obligations of all parties as to their roles and responsibilities in the dissertations, refereed journal articles, supplemental data management and retention of research data. It must also consider changes to roles and  responsibilities that will occur should a principal investigator or co-PI leave the institution. “ENG
  • “Any costs should be explained in the Budget Justification pages. “ ENG
  • “requires that proposal Project Descriptions outline plans for preservation, documentation, and sharing of data, samples, physical collections, curriculum materials and other related research and education products”  OCE
  • “DMR PIs should include in their Data Management Plan those aspects of data retention andsharing that would allow them to respond to a question about a published result.”  MPS-DMR
  • “Due to the diverse communities supported by DMR, the Division is not in a position to recommend a Division-specific single data sharing and archiving approach.”  MPS-DMR
  • “The Physics Division is not in a position to recommend a Division-specific single data sharing and archiving approach applicable to the disparate communities supported through the Division.The Division will rely on the process of peer review to allow each of these communities toidentify best practices.”  MPS-PHY

Other notes:

  • I didn’t go into detail extracting info from the IODP doc.  Useful, clear, lengthy doc!
  • Looking forward to hearing from the Biology Directorate.  Others?
  • “Goal: Provide for clear, effective, and transparent implementation of NSF policy for data management and dissemination”  ENG.  Awesome.
  • “Where data are stored in unusual or not generally accessible formats, explain how the data may be converted to a more accessible format or otherwise made available to interested parties. In general, solutions and remedies should be provided.”  MPS
  • “Ensure that dissemination of the scientific findings of all IODP drilling projects/expeditions are planned so as to gain maximum scientific and public exposure”  IODP.
  • A lot of emphasis on “other researchers.”  Obligations to share data with commercial researchers are not clear, except where the language emphasizes “public”
  • Overall, I’m pretty impressed by all of this.  I was hesitant about the new NSF policy based on preliminary info:  it felt like too small a step.  But the Directorates have stepped up and given it meat and a backbone.  Nice work.  NIH, your turn again.

Reference docs on current policy

  • [SES] Division of Social and Economic Sciences
  • [EAR] Division of Earth Sciences
  • [ENG] Engineering Directorate
  • [OCE] Division of Ocean Sciences
  • [IODP] Integrated Ocean Drilling Program

Related documents

  • Committee on Strategy and Budget Task Force on Data Policies Charge and timeline (Draft final report expected first half of 2011)
  • NSF Press Release 10-077:  Scientists Seeking NSF Funding Will Soon Be Required to Submit Data Management Plans (May 10 2010)

ETA:  added MPS guidelines

August 13, 2010

Supplementary materials is a stopgap for data archiving

Filed under: Policies — Tags: , , , , — Heather Piwowar @ 11:23 am

The Journal of Neuroscience has issued a new policy on supplementary materials:

Beginning November 1, 2010, The Journal of Neuroscience will no longer allow authors to include supplemental material when they submit new manuscripts and will no longer host supplemental material on its web site for those articles

I think this will benefit the reporting of methods and exploratory analyses. I am thrilled that citations will no longer be lost in supplementary materials, assuming the additional citations make it into the main references list rather than being omitted.

But what about data?

A journal’s supplementary material section is not a great place for data. Limitations include:

  • not good for data formatting and reporting standards
  • not good for discoverability
  • not good for truly permanent storage
  • not good for machine retrievability
  • not good for journals sticking to core competencies
  • not good for journal planning, efficiency
  • not good for free access (in subscription journals)
  • not good for open access (or at least conveying openness clearly)
  • not good for lots of other things that I don’t know about and publishers don’t know about but repository professionals do know about

Most people would agree that well-designed, well-supported data repositories are the best place for data. The problem is, such repositories are few and far between. All is well and good if an experiment is in a discipline or produces a datatype for which a best-practice repository exists: the data should go there. All may be good if the authors are in an institution with an institutional repository that is well-equipped to handle scientific data, though these are uncommon. Otherwise where can investigators put their datasets?

Supplementary information is not a perfect home, it is not even very good, but it is better than hosting data on a lab websites or email-on-demand. It is a useful stopgap while more discipline-based repositories and institutional repositories rise to fill the need.

By removing this stopgap, in my opinion (and with the important caveat that I know very little about the journal or its discipline), The Journal of Neuroscience has sent three messages with its new policy:

1. They don’t consider archiving data to be their responsibility

This was already clear from their lackluster policy on data archiving:

Policy on Concerning Availability of Materials
It is understood that by publishing a paper in The Journal of Neuroscience the author(s) agree to make freely available to colleagues in academic research any clones of cells, nucleic acids, antibodies, etc. that were used in the research reported and that are not available from commercial suppliers.

Policy on DNA Sequences
[…] By the time a paper is sent to press, sequences must be deposited in a database generally accessible to the neuroscience community; the sequence accession number should be provided. Exceptions to this policy may be considered on an individual basis.

That’s it. Compare this to the comprehensive policies of other journals, particularly their statements of motivation. For example, in Science:

After publication, all data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.

And in Stem Cells (similar in Cell):

Stem Cells supports the efforts of the National Academy of Sciences (NAS) to encourage the open sharing of publication-related data. Stem Cells adheres to the beliefs that authors should include in their publications the data, algorithms, or other information that is central or integral to the publication, or make it freely and readily accessible; use public repositories for data whenever possible; and make patented material available under a license for research use.

The Journal of Neuroscience has said that it wants to “maintain its leading position.” For what it is worth, evidence suggests that the highest impact journals have the strongest data sharing policies.

2. They don’t consider archiving data important

Based on the policy and the wording of its announcement, I was left with the impression that the Journal doesn’t consider data archiving important. In particular, stating that “supplemental material is inherently inessential” and “We should remember that neuroscience thrived for generations without any online supplemental material” belittles data sharing, given that much data is currently shared in supplementary materials for lack of a better place to put it.

The policy has left investigators with fewer better-than-nothing places to share data. I hope the next journal that is tempted to eliminate supplementary material will consider these alternative approaches to address its problems while supporting data archiving:

  • Fix rather than eliminate supplemental material policies: clearly specify that supplemental info is not peer-reviewed, specify that suppl info is only for data (for example), remind reviewers and authors that suppl info is not for defensive material, etc.

    One example is the thoughtful response by Cell to its problems with supplemental material, a solution of defining what should and shouldn’t be included:

    “One of the first issues we confronted in thinking about structuring supplemental material was one of setting limits. Limits of course have both positives and negatives. On the plus side, it seems in the best interest of everyone in the scientific community that the concept of a ‘‘publishable story’’ be at least roughly defined. […] strict overall length limits struck us as somewhat arbitrary, and we instead focused on a more conceptual organization.”

  • Or, if you do indeed want to eliminate supplementary materials, recommend and in fact require that links to supplementary information elsewhere are either to established repositories or to resources archived through one of the many mechanisms for url permanence.
  • Or, engage with Dryad or another discipline-based repository to find a win-win solution
  • And please commit to participating with the community to find solutions, rather vaguely suggesting, “It is conceivable that removing supplemental material from articles might motivate more scientific communities to create repositories for specific types of structured data, which are vastly superior to supplemental material as a mechanism for disseminating data.”

3. Change is needed

I completely agree with them here. Change is needed. I also applaud the Journal for taking a bold step, even if I disagree with its particulars. I think it will motivate, inspire, and induce change. Bring on the market disruption… although it is a real shame if we lose a bunch of (expensive) (irreplaceable) data (forever) in the process.

A follow-up post with references on supplementary material.

Other blogosphere commentary:

ETA: link to followup post

August 5, 2010

Sharing data makes our shoulders broader

Filed under: Uncategorized — Tags: , , , , — Heather Piwowar @ 11:27 pm

Here is a presentation intro I’ve used recently in talks about data sharing and reuse.  I post it as a reference to thank those who share their Flickr photos under Creative Commons licenses.
Feel free to reuse! (obviously with Flickr attributions intact and license choices respected)

If I have seen farther it is by standing on the shoulders of giants, said Isaac Newton and others before him.

While historians speculate that Isaac Newton was actually being sarcastic,

most of us would agree that science progresses by standing on shoulders of those who came before. Or by kneeling on their backs.  Or clambering up their work any other way we can.

photo credit: jsmjr on Flickr

I suggest that when we share our research output, not only as published research descriptions, but also in the form of open datasets and methods, we are, in effect, making our shoulders broader.

It's Hot Outside

photo credit: camilleharrington on Flickr

All of a sudden, a lot more people can build on our work.


Photo credit: rkuhnau on flickr

Researchers can climb higher than otherwise possible,

photo credit: conform on Flickr

and jump up and down on our findings to make sure they are really stable.


photo credit: rkuhnau on flickr

It allows contributions from places we may never have expected,

Monkey shoulder, and other fine single malts

photo credit: Zemlinki! on Flickr

and investigators can explore places they never could have on their own.


photo credit: Matthew and Tracie on Flickr

In short, our broad-shouldered research can make a contribution that far exceeds its original role.

Four Poles

photo credit: druclimb on Flickr

Edited 2010/08/09 to add concluding photo.

July 12, 2010

Recap of iEvoBio BoF on open science, data sharing & reuse, credit.

Filed under: conferences — Tags: , , , , — Heather Piwowar @ 10:52 am

The organizers of the recent iEvoBio meeting have asked for a summary of the Birds-of-a-Feather session.  I didn’t take notes, but here is a start:

About 10 people participated in the BoF that merged the three sign-up topics “open notebook science”, “data sharing and reuse”, and “data citations and a culture of credit.”

We had an energetic and wide-ranging discussion that included participation from people with diverse backgrounds, perspectives, and opinions.  A few of the topics included:

  • the variants of open notebook science and how they are supported (or undersupported, in some cases) by Open Wet Ware
  • the need to publish minimal data slices to prevent scooping, particularly for some datatypes, and how it can lead to misinterpretation of the data by others
  • whether data-producing authors should be contacted as collaborators for reuse
  • the fact that credit is essential, yet so is remembering that our jobs are fundamentally to contribute to scientific progress
  • support for dynamic CV that included up-to-date reuse metrics for articles, data, and nontraditional outputs.

If you were there, do you have things to add?  Respond in the comments or on twitter with #ievobioBof .

I learned a lot from the perspectives of others in the discussion:  looking forward to more conversations at future meetings.

August 15, 2008

Participation statement for SIG USE 2008

Filed under: conferences — Tags: , — Heather Piwowar @ 10:52 am

The theme of SIG USE 8th Annual Research Symposium at ASIST 2008 is “Future Directions: Information Behavior in design & the making of relevant research.”

It will be held Saturday, October 25, 2008, from 1:00 pm to 6:00 pm at the Hyatt Regency, Columbus, OH

The organizers of the symposium are asking (details) for a Participation Statement in advance… a one-pager that addresses the topic of “communicating the significance of information behavior research to designers of products, systems and services” through four provided questions. The statement is due today. I just finished my draft. I welcome comments on these thoughts, either before or after I submit it this afternoon :)

How does our research address the transformative relationship between people and information?

I study the sharing and reuse of scientific datasets. When scientists make their collected data openly available (often at personal cost in time and opportunity), they increase the information resources of the scientific community. Other scientists may then choose to examine, critique, aggregate, refine, and repurpose these datasets to achieve efficient scientific progress.

Understanding the behaviours of people and information in this complex system is crucial if we wish to develop and refine policies, tools, and practices for effective research.

What fundamental questions should we be looking at in our research?
  • What motivates scientific researchers to share data?
  • Are current incentives for voluntary data-sharing effective? How can they be improved?
  • Are current data-sharing mandates effective? How can they be improved?
  • Does the quality of shared data suffer when the act of sharing is mandated?
  • Sharing information transforms Research – does sharing information transform Researchers?
  • What motivates researchers to reuse scientific datasets? What obstacles do they encounter?
  • Does reusing data in fact lead to more efficient, focused research progress? With what caveats?
  • Are the costs of sharing data worth the benefits?
How are we to move towards making a greater impact on organizations and designers?

Three ingredients allow research to make a difference in the real world: we need to make our research relevant, actionable, and accessible.

Relevant: Choose research settings that are as concrete and realistic as possible. Don’t just survey: measure demonstrated behaviour. Don’t just mock-up: observe the users with their native applications. Don’t just invent clean tasks: study users doing the work they really do.

Actionable: Study issues where results can be directly translated into change. For example, focus on how funder and journal policies impact data sharing behaviour, rather than the correlation between data sharing behaviour and the number-of-paper-authors.

Accessible: Publish open access (OA journals, or self-archive on the web). Publish in the journals and conferences of the intended audience. Write without jargon. Organize tutorials, workshops, and bird-of-a-feather sessions at audience conferences. Send letters to journal editors. Volunteer for policy committees and design teams. Blog. Film a video. Share your data. Encourage others to do the same.

How can or should information behavior research be presented to translate effectively into the language of other information research communities?

As a new member of the information research community, I look forward to learning from the conversation!

Does sharing information transform Researchers?

Filed under: conferences — Tags: , , , , , — Heather Piwowar @ 10:35 am

As you can imagine, a few months and an untold number of conferences makes for a lot of blogging fodder. The posts requiring new thought will have to wait, since I’m in the middle of a pilot-project crunch. But to get the ball rolling….

The ASIS&T conference looks very relevant to my research interests of data sharing and reuse. The special interest group SIG USE focuses on Information Needs Seeking and Use… definitely something I should check out. Consequently, I decided to apply for their PhD Student Conference Travel Award. The application required a statement that addresses “an issue relating to the current year’s conference theme in relation to information behavior research (how people need, seek, manage, give, and use information in different contexts).” This year’s ASIS&T conference theme is People Transforming Information – Information Transforming People. My entry statement is below, in case it is of interest.

People Transforming Information – Information Transforming People

Sharing information transforms Research –
Does sharing information transform Researchers?

Research advances when investigators build upon the results of others. This is possible whenever research results are shared at a sufficient level of detail to allow others to understand, replicate, critique, and expand on analyses; leverage is greatest when raw research data can be explored. Unfortunately, although shared raw research data has many benefits for the general research community, as yet there are few demonstrated benefits for the original investigators who bear the costs of making their datasets available.

I wish to evaluate the transformative effect on researchers when they share their data. Do they, as one might guess, receive more citations because their published results have an expanded usefulness? Do they become more aware of other available datasets, and thus more likely to reuse shared data in the future? Do they become more likely to embrace other mechanisms, such as publishing through open access models, for making their research results widely available?

I am engaged in a long-term effort to identify instances of research data sharing and reuse. I plan to address the above questions, and others, by looking at publication patterns, covariates, and data-sharing behavior of various research communities. Through this work, I hope to quantify incentives for researchers to share their data, identify synergies between open access and open data, and highlight the need to evaluate policies and behaviors to realize the full potential of our research activities.

Travel award winners to be announced at the SIG USE symposium in October.

March 25, 2008

Identifying Data Sharing in Biomedical Literature

Filed under: MyResearch — Tags: , , — Heather Piwowar @ 12:04 pm

I emailed AMIA again to ask for clarification on their preprint policy, and quickly received this encouraging response: Preposting is fine so long as the other sites don’t formally publish the work.” Great news, thanks AMIA.

Note: this brings my blog up-to-date on the research I’ve been doing, with the exception of one paper under review at PLoS Medicine. That one is a complex collaboration. Despite some attempts there isn’t consensus about making it open at this point.

Here is the paper we submitted to the AMIA 2008 Annual Symposium. AMIA=American Medical Informatics Association. Nature Precedings link to appear once it has been posted.

Identifying Data Sharing in Biomedical Literature
Heather A. Piwowar and Wendy W. Chapman

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using natural language processing (NLP) techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Full text


My inspiration for this work was the idea of a Data Reuse Registry and associated research. As discussed, a DRR would benefit from automatic identification of data reuse in the biomedical literature. Unfortunately, automatic identification of data reuse is a tough place to start my NLP (natural language processing) journey because I haven’t found any large, pre-existing gold standards of data reuse to use for evaluating such a system (this list of GEO “third party” data reuse papers is a start).

Identifying data sharing is easier: there are available gold standards via database links, and authors tend to use more uniform language in describing sharing than reuse. Automatically detecting data sharing could be useful to my research in other ways as well, down the road, as I look towards further sharing policy evalutation.

This data sharing identification system used very simple NLP techniques. Hope to (and will probably need to) dig into some more complex approaches as I tackle data reuse identification.

If anyone knows of other resources that list specific instances of data reuse, I’d love to hear about them!

March 21, 2008

Eating my own dogfood

Filed under: sharingdata — Tags: — Heather Piwowar @ 8:51 am

I guess eating dogfood really refers to companies who use their own software, rather than researchers who apply their research topics to their own research. “Practice what I preach” is more accurate, but less fun. And more, well, preachy.

ANYWAY, the point is, as I’m doing all of this research into data sharing behaviour, I’m making a point of sharing my own data. I’m not sure that anyone will ever want to use it for anything, but who knows? Maybe. From an editorial on Nature Neuroscience [doi:10.1038/nn0807-931]:

Does anyone want your data? That’s hard to predict, but the easier it becomes to request data and to receive credit for sharing it, the more likely people are to ask. After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay. Your data, too, may simply be awaiting an effective matchmaker.

It also lets me experience what it feels like to share data. It isn’t the same, I know, as sharing data from a multi-year, career-making, blood sweat and tears project, but it is something.

Sharing data is indeed hard. Specifically:

  • time consuming
  • decision-intensive (where to put it? what to share? what format to share it in?)
  • scary (what if someone finds a mistake?)
  • embarrassing (the data isn’t nearly as X as I wish I had the time to make it )

I also get to experience some of the first-hand benefits:

  • it forces additional organization
  • it helps me find my own data again later, from any computer!
  • it makes me feel proud to have made my science transparent (albeit after the fact, rather than as open notebook science)

I’m a firm believer in continual improvement. That means that I’ve shared my data now, in the best way that I have time for, rather than waiting until I can share it the way that I’d ideally like to. There are lots of things I’d like to improve:

  • Put it somewhere central and permanent (not clear where, for the esoteric dataset types that I have, but there are some neat possibilities)
  • Put it in a semantic format (!!!)
  • Document it better
  • Tag it so people can find it
  • ….

I’ll keep exploring and implementing these things as I get a chance.
If you want to put your data up but have hesitations about it, I say do it to the best of your ability right now given your current constraints. It isn’t perfect? I know, but perfect is the enemy of good enough.


  • Ditto for statistical scripts, but that’s another post.
  • Blog as data: bbgm used Dapper as a way to Semantify [the bbgm] site. Sounds fun, I’d like to try when I get a minute.
  • Have you heard this joke? “Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, you’re a mile away and you have their shoes.” I love that one :)

March 20, 2008

Prevalence and Patterns of Microarray Data Sharing

Filed under: research — Tags: , , , — Heather Piwowar @ 1:13 pm

This poster was presented at the Pacific Symposium on Biocomputing in January (I wasn’t able to go to Hawaii to stand beside it, unfortunately!)  Data available here.  Looking forward to turning the work into a paper in the next few months.   Comments and suggestions are very welcome.

Piwowar HA, Chapman WW (2008) Prevalence and Patterns of Microarray Data Sharing. Poster at PSB 2008.


Sharing research data is a cornerstone of science. Although many tools and policies exist to encourage data sharing, the prevalence with which datasets are shared is not well understood. We report our preliminary results on patterns of sharing microarray data in public databases.

The most comprehensive method for measuring occurrences of public data sharing is manual curation of research reports, since data sharing plans are usually communicated in free text within the body of an article. Our early findings from manual curation of 100 papers suggest that 30% of investigators publicly share their full microarray datasets. Of these, 70% of the datasets are deposited at NCBI’s Gene Expression Omnibus (GEO) database, 20% at EBI’s ArrayExpress, and 10% in smaller databases or lab or publisher websites.

Next, we supplemented this manual process with a rough automated estimate of data sharing prevalence. Using PubMed, we identified research articles with MeSH terms for both “Gene Expression Profiling” and “Oligonucleotide Array Sequence Analysis” and published in 2006. We then searched GEO and ArrayExpress for links to these PubMed IDs to determine which of the articles had been credited as an originating data source.

Of the 2503 articles, 440 (18%) articles had links from either GEO or ArrayExpress. Of these 440 articles, 70% had links from GEO and 30% from ArrayExpress, with an overlapping 12% from both GEO and ArrayExpress.

Interestingly, studies with free full text at PubMed were twice (Odds Ratio=2.1; 95% confidence interval: [1.7 to 2.5]) as likely to be linked as a data source within GEO or ArrayExpress than those without free full text. Studies with human data were less likely to have a link (OR=0.8 [0.6 to 0.9]) than studies with only non-human data. The proportion of articles with a link within these two databases has increased over time: the odds of a data-source link for studies was 2.5 [2.0 to 3.1] times greater for studies published in 2006 than 2002.

As might be expected, studies with the fewest funding sources had the fewest data-sharing links: only 28 (6%) of the 433 studies with no funding source were listed within GEO or ArrayExpress. In contrast, studies funded by the NIH, the US government, or a non-US government source had data-sharing links in 282 of 1556 cases (18%), while studies funded by two or more of these mechanisms were listed in the databases in 130 out of 514 cases (25%).

In summary, our initial manual approach for identifying studies which shared their data was comprehensive but time-consuming; natural language processing techniques could be helpful. Our subsequent automated approach yielded conservative estimates for total data sharing prevalence, nonetheless revealing several promising hypotheses for data sharing behavior

We hope these preliminary results will inspire additional investigations into data sharing behavior, and in turn the development of effective policies and tools to facilitate this important aspect of scientific research.

Poster PDF.

A review of journal policies for sharing research data

Filed under: MyResearch — Tags: , , , , — Heather Piwowar @ 1:00 pm

Inspired by the reception to this blog post, I systematically reviewed journal data sharing policies with gene expression microarray data as a use case. The brief and extended abstracts are below. Supplementary information is here. Full paper to be written prior to presentation in Toronto this June. I’m planning to finish writing the paper in the open, so I’d love to hear your comments.

ETA: Now up at Nature Precedings. ps mom ETA = edited to add

Piwowar HA, Chapman WW (2008) A review of journal policies for sharing research data. Accepted to ELPUB2008 (International Conference on Electronic Publishing): Open Scholarship: Authority, Community and Sustainability in the Age of Web 2.0

Background: Sharing data is a tenet of science, yet commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. The purpose of this study is to understand the current state of data sharing policies within journals, the features of journals which are associated with the strength of their data sharing policies, and whether the strength of data sharing policies impact the observed prevalence of data sharing.
Methods: We investigated these relationships with respect to gene expression microarray data in the journals that most often publish studies about this type of data. We measured data sharing prevalence as the proportion of papers with submission links from NCBI’s Gene Expression Omnibus (GEO) database.
We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access).
Results: Of the 70 journal policies, 18 (26%) made no mention of sharing publication-related data within their Instruction to Author statements. Of the 42 (60%) policies with a data sharing policy applicable to microarrays, we classified 18 (26% of 70) as weak and 24 (34% of 70) as strong.
Existence of a data sharing policy was associated with the type of journal publisher: half of all commercial publishers had a policy compared to 82% of journals published by an academic society. All four of the open-access journals had a data sharing policy. Policy strength was associated with impact factor: the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.5, and 6.0. Policy strength was positively associated with measured data sharing submission into the GEO database: the journals with no data sharing policy, a weak policy, and a strong policy had median data sharing prevalence of 11%, 19%, and 29% respectively.
Conclusion: This review and analysis begins to quantify the relationship between journal policies and data sharing outcomes and thereby contributes to assessing the incentives and initiatives designed to facilitate widespread, responsible, effective data sharing.

Extended abstract:


Older Posts »

Blog at