Research Remix

January 11, 2013

Process behind a Nature Comment

Filed under: Uncategorized — Heather Piwowar @ 8:55 am

Publishing a Comment in Nature involved a process unlike any I’ve experienced to date, so I figured I’d document it (Comment itself is here).   I wish more people would document the story behind their papers (and #OverlyHonestMethods :) ), and also the process behind their scientific communication to help us all peek behind the curtains.  Or, yknow, take down the curtains.


I received an email from a Nature editor on November 1:

[..] I’m an editor in the Comment section at Nature, which features opinions by scientists. [..] I’m writing because a few issues have popped up that we thought you might have some insights on [..]

We’re interested in exploring a piece about the NSF’s decision to change “papers” to “projects” in scientists’ list of achievements. [..]

Does this topic spark any interest? If so, let’s chat – we’d want to time something to the first of the year, when the NSF change goes into effect. [..]

(I won’t name the editor because I don’t want to catch her unaware…. I’m not sure if it is appropriate to name her, so I’ll err on not.  She was very skilled and pleasant to work with, fwiw!).

Needless to say, the proposed topic is of interest to me and a Comment in Nature seemed like a great way to reach a broad and “traditional” audience with my thoughts on where this is going.  We set up a phone call for later in November.

Writing and Editing

The editor and I had a 15 minute call about my thoughts on the topic, and also about how Comments work.

I mentioned that I’d recently given a brief talk about implications of the NSF Biosketch policy change.  She suggested I send that along to her, and she’d reply with a paragraph-by-paragraph suggestion on how I compose a first draft of the comment.

The editor sent me a reply that had a surprisingly detailed outline:

Starting with the text you sent about your talk is great – it’s a good tone and level for our readership. We can just build on that. [..]

First paragraph: “hook” the reader. Like feature and news stories, or even editorials in a newspaper (which is really our model here), we need something that will “grab” the reader, make them want to [..]

Next 1-2 paragraphs: Describe the NSF change in policy, for readers who aren’t familiar with it [..]

Third paragraph: Present the crux of your argument: I think this change in NSF policy, along with other examples mentioned, indicate X [..]

Background, 2-3 paragraphs: Present examples of the changes [..]

Next 2-3 paragraphs: Explain more why these changes are so significant for science. Here is where you’ll put [..]

Final paragraphs: Here, we present “solutions.” How should things change further? What direction would [..]

Wow!  ok, sure, if that is how it works, I can do that.  So I pulled together a first draft, which I’ve posted here.  That’s when it got intense.  I’ve never had anything so heavily edited.  In addition to emailing drafts back and forth, we had two (or three? I forget) quick phone calls where the editor asked me clarification questions, then she’d send me another draft.  It took five revisions till it was time for her to pass it to her boss and the subeditors.

The subeditor was also great.  The subeditor sent me a revised version, and at this point it was layed out as a PDF.  I had a list of changes to maintain accuracy given the new edits.  There were about 3-4 more versions after this, with small changes.

Overall, I’d say this whole process made the resulting paper much more readable than it started.  It also changed the focus a bit, to having a stronger altmetrics focus, rather than being primarily about the alternative products.  I’m ok with that, though I do mourn some of the details in the original draft that didn’t make it into the final version.  I do kinda feel like the editor should be a coauthor, for what it is worth…. I think we’ve all had coauthors who did less than she did!  Feels a little strange that there is so much behind-the-scenes help in crafting these articles and that isn’t transparent at all.

One area I had no clear say was the title and subheading.  It went through 2-3 titles and 4-5 subheading phrases and locations in the versions I saw.  I did object to one of the versions (“creeping changes”), but in general it wasn’t clear that the title was my decision.  I didn’t know that the title in the HTML version and CrossRef was going to be prefixed with “Altmetrics:” because the title on the PDF copy I saw was simply titled “Value all research products”.  I’m a little unhappy about the leading “Altmetrics:” because I think it complicates the main thrust of the piece and makes it easy for people to get tangled, for example about whether blog posts are alt-products or sources of altmetrics (answer:both).  oh well, that’s ok: altmetrics is sexy, it makes sense to lead with it, and I’m certainly a big believer!


Because the article was due out just after the holiday break, with a fixed publication date to coincide with the new policy implementation on Jan 14, the turn-around time I had for many of these revisions was very short (10 days for the initial draft, a few days for revisions, near the end less than a day for final revisions).  This was fine with me, I just note it so that others will know what you are getting into.

Copyright and Paywall

The other point I want to mention here is how Copyright works with Comments.   I admire Nature’s policy for copyright for research articles, given that they are a non open-access journal:  they do not require that authors sign away their copyright, instead they ask that authors grant Nature an exclusive license to publish.

Nature has a different policy for Comments.  You have to sign away your Copyright to Nature.  As a huge proponent of Open Access, I thought long and hard about whether I was ok with this.  I decided for this editorial content I was.  Happy to discuss :)

Here is the form that I signed. UK Comment CA  I upload it because I did not sign an NDA, and I know that I would have liked to find it online when I was first contacted by them to help me understand details of the agreement I’d be entering.

The first editor who contacted me knew that I am a strong supporter of OA.  Though she said that it would not be possible to make this comment OA, she said that we could nominate it to be one of the “free” articles.  I held fast to this, and in mid December requested with her that we do indeed make this request if she hadn’t already done so.  She was happy to do so, and asked me for a once sentence justification for why this paper should be freely available, because that is used by the group who makes the decisions.  Not sure I knocked this out of the park , but fwiw here’s what I sent:

People will likely circulate this article outside academia, since altmetrics is about valuing broad contributions to science, and broad interactions with science — high school viewers of wetlab YouTube videos, silicon valley dotcom contributors to science source code repositories, etc.

The good news is that they did decide to make my article free for “at least a week.”  It wasn’t free when it first went up, interestingly, but the paywall page stopped appearing within 12 hours.

One more thing for completeness.  I’ve heard some people are paid small amount for Comments?  I’m not sure if that is true or not.  In any event:  money was never mentioned to me, I wasn’t paid anything.

So there ya go.  Now you know everything I know about how Nature Comments work.

January 10, 2013

First draft of just-published Value all Research Products

Filed under: Uncategorized — Heather Piwowar @ 9:00 am

The copyright transfer agreement (arg) I signed for the Comment in Nature included restrictions on where I may post a copy of the article:

Although ownership of all Rights in the Contribution is transferred to NPG, NPG hereby grants to the Authors a licence […]
c) To post a copy of the Contribution as accepted for publication after peer review (in Word or Tex format) on the Authors’ own web site, or the Authors’ institutional repository, or the Authors’ funding body’s archive, six months after publication of the printed or online edition of the Journal, provided that they also link to the Journal article on NPG’s web site (eg through the DOI).

The article is available for free for a week or two on Nature’s site, and I’ll post the text here as soon as I can, six months from now.

In the meantime, as per contract lingo above, I may post the first draft that I sent the Nature editors.  So here is the first draft, for the benefit of those who are looking for a free version in the first half of 2013, and for anyone who cares to compare the first draft to the final draft :)   [Hint: there were MANY rounds of editing.  more on that in next post…. ]

NSF policy welcomes alt-products, increases need for altmetrics

(or perhaps NSF welcomes bragging about software, datasets in proposals)

Research datasets and software no longer have to masquerade as research papers to get respect.  Thanks to an imminent policy change at the NSF, non-traditional research products will soon be considered first-class scholarly products in their own right, and worth bragging about.  This policy change will prove a key incentive to produce and disseminate alternative products, and have far-reaching consequences in how we assess research impact.

Starting January 14th, the NSF will begin to ask Principal Investigators to list their research Products rather than Publications in the Biosketch section of funding proposals.  Datasets and software are explicitly mentioned as acceptable products in the new policy, on par with research articles.

The policy update reflects a general increase in attention to alternative forms of scholarly communication.  Policies, repositories, tools, and best practices are emerging to support an anticipated increase in dataset publication, spurred, in part, by now-required NSF data management plans.  Tools for literate programming, reproducible research, and workflow documentation continue to improve, highlighting the need for shared software.  Open peer review, online lab notebooks, post-publication discussion — as it gets easier to “publish” a wide variety of material online it becomes easy to recognize the breadth of our intellectual contributions.

I believe in the long run this policy change from Publications to Products will do much more than just reward an investigator who has authored a popular statistics package.  It is going to change the game, because it is going to change how we assess research impact.

The change starts by welcoming alternative products.  The new policy welcomes datasets, software, and other research output types in the same breath as publications: “Acceptable products must be citable and accessible including but not limited to publications, data sets, software, patents, and copyrights. Unacceptable products are unpublished documents not yet submitted for publication, invited lectures, and additional lists of products.”  In contrast, previous versions of the Biosketch instructions policy allowed fewer types of acceptable products (“Patents, copyrights and software systems”) and considered their inclusion to be a “substitution” of the main task of listing research paper publications.

The next step will become apparent when we consider what peer reviewers will want to know when they see these alternative products in a Biosketch.  What is this research product?  Is it any good?  What is the size and type of its contribution?  We often assess the quality and impact of a traditional research paper based on the reputation of the journal that published it.  In fact the UK Engineering and Physical Sciences Research Council makes this clear in its fellowship application instructions: “You should include a paragraph at the beginning of your publication list to indicate … Which journals and conferences are highly rated in your field, highlighting where they occur in your own list.”

Including alternative products will change this: it necessitates a move away from assessment based on journal title and impact factor ranking.  Data and software can’t be evaluated with a journal impact factor — repositories seldom select entries based on anticipated impact, they don’t have an impact factor, and we surely we don’t want to calculate one to propagate the poor practice of judging the impact of an item by the impact of its container.  For alternative products, Item level metrics are going to be key evidence for convincing grant reviewers that a product has made a difference.  The appropriate metrics will be more than just citations in research articles: because alternative products often make impact ways that aren’t fully captured by established attribution mechanisms, alternative metrics (altmetrics) will be useful to get a full picture of how research products have influenced conversation, thought, and behaviour.

The ball will bounce further.  Once altmetrics and item level metrics become expected evidence to help assess the impact of alternative products, the use of item-level altmetrics will bounce back to empower innovations in the publication of traditional research articles.  Starting a new or innovative journal is risky: many authors are hesitant to publish their best work somewhere unusual, somewhere without a sky-high impact factor.  When research is evaluated based on its individual post-publication reception, innovative journals become attractive, perhaps competitively more attractive than staid established run-of-the-mill alternatives.  Reward for innovative journals will result in more innovations in publishing.  Heady stuff!

A few large leaps are needed to realize this future, of course.  First, this one policy change hardly represents a consistent message across the NSF.  Accomplishment-Based Renewals are still based on “six reprints of publications”, with no mention of alternative products.  Even in the Grant Proposal Guide, the same document that houses the new Products policy, the instructions for the References Citations section are written as if only research articles would be cited in a grant proposal.  What about preliminary data on figshare, or supporting software on RunMyCode, or a BioStar Q&A solution, or a patent, or a blog post, or, for that matter, an insightful tweet?  If we think these products are potentially valuable, the NSF should welcome and encourage their citation anywhere it might be relevant.

The second hurdle is that a policy welcoming the recognition of alternative products is not yet common outside the NSF.  A brief investigation suggests that many other funders — including the NIH, HMMI, Sloan, and UK MRC– still explicitly ask for a list of research papers rather than products.  A few, like the Wellcome Trust and UK BBSRC just seem to ask broadly for a CV, leaving the decision about its contents to the investigator.  This could be good, but because investigators are not used to considering alternative products to be first-class citizens, explicit welcoming is important to drive change.

The third challenge between us and a new future brings us to an exciting area under active development.  When products without journal title touchpoints start appearing in BioSketches, how will reviewers know if they should be impressed?  Reviewers can (and should!) investigate each research product itself and evaluate it with their own domain expertise.  But what if an object is in an area outside their expertise?  They need a way to tap into the opinion of expert in that domain.  Furthermore, beyond the intrinsic quality of the work, how will reviewers know if the Intellectual Merit has indeed been impactful on scholarship and the world, and thus should lend credence to the proposal under consideration?

Many data and software repositories keep track of citations and download statistics.  Some repositories, like ICPSR, go a step further and provide anonymous demographic breakdowns of usage to help us move beyond “more is better” to an understanding of the flavour of the attention.  This context will become richer as more types of engagement are added:  is the dataset being bookmarked for future use?  Who is cloning and building on the open software code?  Are blog posts be written about the contribution?  Who is writing them and what do they say?

Tools are available today to collect and display this evidence of impact.  Thomson Reuter’s Data Citation Index aggregates citations to datasets that have been identified by data repositories. identifies blog posts, tweets, and mainstream media attention for datasets with a DOI or handle: try it out using their bookmarklet.  The nonprofit organization ImpactStory tracks the impact of datasets, software, and other products, including blog and twitter commentary, download statistics, and attribution in the full text of articles: give it a try.  I’m a cofounder of ImpactStory: we as scientists need to go beyond writing editorials on evaluation and actually start building the next generation of scholarly communication infrastructure.  We need to create business models for infrastructure that support open dissemination of actionable, accessible and auditable metrics for research and reuse.

Finally, the practice shift to value broad impact will be more rapid and smooth if funders and institutions explicitly welcome broad evidence of impact.  Principal investigators should be tasked with making the case that their research has been impactful.  Most funders, including the NSF, do not currently ask for evidence of impact.  This may be changing: the NIH issued an RFI earlier this year on BioSketch changes that would include documenting significance.  In the meantime, the lack of an explicit welcome hasn’t stopped cutting-edge investigators from augmenting their free-form CVs and annual reviews to mention that their work has been “highly accessed” or received a F1000 review.  This — and next generation evidence with context — should be explicitly welcomed.

Despite these hurdles, the future is not far away.  You and I can start now.  Create research products, publish them in their natural form without shoehorning everything to look like an article, make citation information clear, track impact, and highlight diverse contributions when we brag about our research.  We’re on our way to a more useful and nimble scholarly communication system.

Just published: Value all research products

Filed under: Uncategorized — Heather Piwowar @ 8:05 am

A Nature editor contacted me in November, asking if I’d like to write a Comment about the upcoming NSF policy change in Biosketch instructions.  It sounded like a great chance to talk about the value of alternative research products with a wide audience, so I agreed.  The comment was published yesterday and is now available here:

Piwowar H. (2013). Value all research products, Nature, 493 (7431) 159-159. DOI:

Because of Nature’s policies about copyright assignments for Comments, the comment is not open access and it is behind a paywall.  Arg.  That said, I requested that it be one of their “free” articles and they agreed, so it will be freely available at the above link for a week or two.  I will post the text up on my website as soon as I am able, 6 months from now.

Working on a blog post about the process behind the scenes, because it was certainly unlike anything else I’ve published to date!

Questions about the piece, or thoughts or opinions?  Welcome below, or on twitter to @researchremix.

July 16, 2012

Many datasets are reused, not just an elite few

Filed under: Uncategorized — Heather Piwowar @ 8:04 am

I’ve recently collected new data on data reuse.  Using the same methods as our Nature letter-to-the-editor analysis, I’ve looked for reuse of gene expression microarray data in PubMed Central by searching for dataset ID numbers in the full text of studies.  Studies that mention a dataset accession number but share author last names with those who deposited the dataset are excluded.

The new results look at datasets deposited into the Gene Expression Omnibus (GEO) repository between 2001 and 2009.

Results for the middle years are particularly important, since by then GEO had a lots of datasets, and between then and now there has been enough time for reuse to accumulate.  We observed reuse of more than 20% of the datasets deposited in 2003 and 17% of datasets deposited in 2007.

Note: the method used to detect reuse here is VERY CONSERVATIVE so these are minimum estimates.  It only finds reuses by papers that are in PubMed Central, and only those that are attributed by mentioning the accession number (it misses those attributed by citation to the article, for example).  Nonetheless, it does serve as a lower bound.

Analysis of the accession number mentions revealed that data reuse was driven by a broad base of datasets: about 20% of the datasets deposited between 2003 and 2007 have been reused by third parties. We note these proportions are gross underestimates since they only include reuses we observed as accession number mentions in PubMed Central; no attempt has been made to extrapolate these distribution statistics to all of PubMed, or to reflect attributions through citations. Further, many important instances of data reuse do not leave a trace in the published literature, such as those in education and training. Nonetheless, even these conservative estimates suggest that reuse finds value in a wide range of datasets, not simply a “very reusable” elite.

(manuscript-in-progress with co-author Todd Vision)

July 13, 2012

Concrete options for a society journal to go OA

Filed under: Uncategorized — Heather Piwowar @ 10:59 am

AMIA‘s society journal, JAMIA, is considering going Open Access. I’ve been invited to be part of the OA explorations task force. JAMIA=Journal of the American Medical Informatics Association.

All taskforce members agreed I could blog our process. In fact, they look forward to hearing suggestions from all of you! So here goes, first installment. Our report is due in September.

My main job on the task force is to outline the available alternatives. Below are my getting-started notes.

What options am I missing? Does anyone already have details for any of these options? Advice for JAMIA if you have been here, done this?


well-defined alternatives:

Three major options seem to be: publish JAMIA with an existing publisher of OA journals, run it independently through a self-hosted journal management system, or run it through a third-party hosted journal management system.

For reference, a SPARC review of scholarly OA journals in 2011 found that Springer published 9 society OA journals, Copernicus published 15, WASET published 21, BioMed Central published 33, and MedKnow published 64. I’m not sure what proportion ran on a self-hosted or externally hosted platform, but OJS lists many journal users.

Links are to the “contact us about your society journal” pages:


We were told to think out of the box.  Excellent!  So, perhaps JAMIA could publish in an OA megajournal within a JAMIA Collection or tag, or ask for modification of terms of a well-defined option?

perhaps out of scope:

related issues


  • license that facilitates most reuse is CC-BY.
  • compromises would decrease use but potentially facilitate reprint revenue (ie BMJ)


  • immediate access facilitates most reuse.
  • compromises would decrease use but potentially facilitate subscription revenue (ie RUP)

editorial content

  • editorials available as OA facilitates most reuse
  • compromises would decrease visibility but potentially facilitate subscription revenue (ie BMJ)


  • could host advertisements and get partial revenue (ie BMC)
  • could charge readers to view without advertisements


  • many OA journals have an automatic waiver or subsidy for authors from low-income countries (ie BMC, BMJ Open). Some also offer a subsidy or waiver upon request (ie BMC). At least one offers a guaranteed waiver for those who cannot afford to pay (PLoS).
  • the majority of society publishers do not charge any author-side fees


  • many OA journals are online-only.  JAMIA is currently available online and print.  Is print needed? Are there options available for print-on-demand?

publishing-charge subsidy

  • as an AMIA membership benefit, could offset article processing charges (ie BMC)

other related revenue possibilities

  • could release openly, have HTML available for free, but charge per-article or membership fee for PDF access (ie JMIR)
  • could charge for expedited peer-review (ie JMIR)
  • could charge submission fees in general
  • could charge for iphone apps, etc

info so far



  • OJS, includes hours/week survey results

open questions

Many. A few:

  • How much of the back content could become OA? Is the copyright currently AMIA’s or BMJ’s? Answer: AMIA’s.

thoughts and observations

  • The OASPA resources section is a little light, and the blog was last updated in 2011. I’d say there have been a few OA events of note since then :) Upcoming conference in Hungary in September.
  • This is a less well trod path than I thought… I’ve made an initial contact to most of the organizations above, and none of them immediately zoomed me a how-to sales package (one or two were quick, but for most of the publishers I’ve contacted it has been 4 days and no response yet).
  • AMIA could join SPARC as an affiliate society. $5,710 annual contribution per calendar year. SPARC is active in advocating for funder mandates for OA, which would likely bring about  greater funder support for processing charges.
  • this is timely: two recent blog posts about OA and societies.  One by Mike Taylor, one by the Scholarly Kitchen.  There are other white papers etc also.  I’ll hopefully get a chance to recap them in a future post.

Edited July 16 to add a few things

July 9, 2012

makingdatacount: Outline #draftInProgress

Filed under: Uncategorized — Heather Piwowar @ 11:25 am

I’ve got a few manuscripts on the go this month.  One of them is on the state of Data Citation Tracking, making the same points as my IDCC talk last year and the recent DataCite presentation by Scott Edmunds of GigaScience (tracking stuff starts at slide 45).

Here’s the draft outline.  Obvious things missing?

Making data citation Count

  • 1. Why it matters
    • Encouraging more data archiving
    • Rewarding production and dissemination of useful data
    • Enabling fine-grained reward for all contributors
    • Discovering associated datasets and researcher communities
    • Filtering for frequently used — or neglected! — datasets
    • Correcting analyses based on erroneous data
    • Avoiding harmful shoehorning
    • Driving policy, funding, and tool requirements based on evidence
  • 2. Obstacles
    • Awareness
    • Encouragement and expectation
    • Agreement on best practices
    • Existing problematic policies
    • Tracking tools
    • Access to the literature to build tracking tools
  • 3. What we want to Count
    • Dataset-level metrics
    • Project-level metrics
    • Repository impact story rather than Repository impact factor
    • Reuses from outside the literature
    • Reuses from outside academia
    • Reuses of the reuses
    • Impact flavour
  • 4. Conclusion

July 3, 2012

Citation11k: Method section — access to citation data #draftInProgress

Filed under: Uncategorized — Heather Piwowar @ 8:27 am

The next installment in my #draftInProgress series on Open Data citation.

I’m not sure this section will make it into the paper in its entirety, though I do think it is important to highlight the serious hurdles in getting access to data for research on research.

This step of the methods was certainly the most time-consuming part of the study!

Methods: citation data

This study required citation counts for thousands of articles identified through PubMed IDs. At the time of data collection, neither Thomson Reuter’s Web of Science nor Google Scholar supported this type of query. It was (and is) supported by Elsevier’s Scopus citation database. Alas, none of our affiliated institutions subscribed to Scopus. Scopus does not offer individual subscriptions, and a personal email to a Scopus Product Manager went unanswered.

One author (HAP) attempted to use the British Library’s walk-in access of Scopus on its Reading Room computers during a trip overseas. Unfortunately, the British Library did not permit any method of electronic transfer of our PubMed identifier list onto the Reading Room computers, including internet document access, transferring a text file from a USB drive, or using the help desk as an intermediary (see related policies). The Library was not willing to permit an exception in this case, and we were unwilling to manually type ten thousand PubMed identifiers into the Scopus search box in the Reading Room.

HAP eventually obtained Scopus access through a Research Worker agreement with Canada’s National Science Library (NRC-CISTI), after being fingerprinted to obtain a police clearance certificate (required because she’d recently lived in the USA for more than six months).

At the time of data collection the authors were not aware of any way to retrieve Scopus data through researcher-developed computer programs, so we queried and exported Scopus citation data manually through interaction with the Scopus website. The Scopus website had a limit to the length of query and the number of citations that could be exported at once. To work within these restrictions we concatenated up to 500 PubMed IDs at a time into 22 queries, where each query took the form “PMID(1234) OR PMID(5678) OR …”

Citation counts for 10694 papers were gathered from Scopus in November 2011.

July 2, 2012

Citation11k: Method section — study attributes #draftInProgress

Filed under: Uncategorized — Heather Piwowar @ 6:37 pm

The next installment in my #draftInProgress series on Open Data citation advantage.  I think this section can be short and sweet.


Methods: study attributes

Piwowar 2011 collected 124 attributes for each of the gene expression microarray studies in our sample.  The subset of attributes previously shown or suspected to correlate with citation rate were included in the current analysis:

  • date of publication
  • journal
  • journal impact factor (2008)
  • journal open access status
  • size of the journal
  • number of authors
  • years since first publication by the first and last author
  • number of papers published by first and last author
  • number of PubMed Central citations received by first and last author
  • country of corresponding author
  • institution of corresponding author
  • institution mean citation score
  • study topic (human/animal study, cancer/not cancer, etc.)
  • NIH funding of the study, if applicable


Citation11k: Method section — assessment of data availability #draftInProgress

Filed under: Uncategorized — Heather Piwowar @ 5:35 pm

The third installment in my #draftInProgress series on Open Data citation advantage.  I reread the methods description in my Who Shares paper and decided to just excerpt it directly for the method details (related thoughts on self-plagiarism and OA).

Methods: assessment of data availability

The independent variable of interest in this analysis is the availability of gene expression microarray data.  Data availability had been previously determined for our sample articles in Piwowar 2011, so we directly reused that dataset [Piwowar Dryad 2011].  This study limited its data hunt to just the two predominant gene expression microarray databases: NCBI’s Gene Expression Omnibus (GEO), and EBI’s ArrayExpress.

“An earlier evaluation found that querying GEO and ArrayExpress with article PubMed identifiers located a representative 77% of all associated publicly available datasets [Piwowar 2010]. [We] used the same method for finding datasets associated with published articles in this study: [we] queried GEO for links to the PubMed identifiers in the analysis sample using the “pubmed_gds [filter]” and queried ArrayExpress by searching for each PubMed identifier in a downloaded copy of the ArrayExpress database. Articles linked from a dataset in either of these two centralized repositories were considered to have [publicly available data] for the endpoint of this study, and those without such a link were considered not to have [available] data.” [Piwowar 2011]


Piwowar H, Chapman W (2010) Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers. J Biomed Discov Collab 5: 7–20.

Piwowar HA (2011). Data from: Who shares? Who doesn’t? Factors associated with openly archiving raw research data. Dryad Digital Repository :

Piwowar HA (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS ONE, 6 (7) :

Citation11k: Method section — which studies? #draftInProgress

Filed under: Uncategorized — Heather Piwowar @ 4:53 pm

The second installment in my #draftInProgress series on Open Data citation advantage.  About one fourth of the methods section:

Methods: Which studies?

The primary analysis in this paper examines the citation count of a gene expression microarray experiment, relative to availability of the experiment’s data.

The sample of microarray experiments used in the current analysis was previously determined (Piwowar 2011 PLoS ONE, data from Piwowar 2011 Dryad).  Briefly, a full-text query uncovered papers with keywords associated with relevant wet-lab methods.  The full-text query had been characterized with high precision (90%, 95% confidence interval 86% to 93%) and a moderate recall (56%, 52% to 61%) for this task.  Running the query in PubMed Central, HighWire Press, and Google Scholar revealed 11,603 distinct gene expression microarray papers.  The papers were published between 2000 and 2009.

The current analysis retained papers published between 2001 and 2009.

Piwowar HA (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS ONE, 6 (7) :

Piwowar HA (2011). Data from: Who shares? Who doesn’t? Factors associated with openly archiving raw research data. Dryad Digital Repository :

« Newer PostsOlder Posts »

Blog at