Research Remix

July 2, 2012

citation11k: intro #draftInProgress

Filed under: Uncategorized — Heather Piwowar @ 3:18 pm

Writing a blog post is waaaaay easier than writing a paper.  I know this because I’ve been trying to write a paper all day.  Well, for months actually, around other things.  But today alone I have tried cajoling myself through early morning silence, lattes, biscotti, email breaks, no email breaks, twitter, no twitter, stern talking tos, walks, timers, and most recently a mocha with whipped cream.  Tellin ya, I’m pulling out all the stops.  But still my progress on the methods section is sad.  My new plan: blog it as I write it!

So, dear readers, here are my draft sections, as I write them.  I’m going to write them for you, because writing for you is way more fun than writing for the peer reviewer in my head.  Then I can edit them to be into proper paper-speak, and improve them through additional feedback from my co-author.  (Hi Todd!)  Note the ideas here already include contributions from Todd Vision, but all the messy thoughts and expressions are mine.

Needless to say: feedback very welcome!

With no further ado.  The working title and introduction:

Data Reuse and the Open Data citation advantage

“Sharing information facilitates science. Publicly sharing detailed research data–sample attributes, clinical factors, patient outcomes, DNA sequences, raw mRNA microarray measurements–with other researchers allows these valuable resources to contribute far beyond their original analysis. In addition to being used to confirm original results, raw data can be used to explore related or new hypotheses, particularly when combined with other publicly available data sets. Real data is indispensable when investigating and developing study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data collection.” [Piwowar, 2007]

Making research data publicly available also has costs. Data archives must be created and maintained. Data must be documented, formatted, and uploaded. Data-collecting investigators may be asked to answer questions when others try to use their data.

Scientists report that receiving more citations would be an important motivator for publicly archiving their data [Tenopir].

Several studies across several disciplines have found an association between data availability and number of citations received by a publication [cite studies below]. This evidence has been frequently referenced, including in new policies that encourage and require data archiving []. It is important, therefore, to continue to strive for an accurate estimate of possible citation benefit.

The present study hopes to improve previous estimates in several ways. First, the present study is large enough to include many key covariates that may have conflated estimates of citation boost in previous, smaller studies: Number of authors, author publication experience, institution, open access availability, and subject area. Second, the current analysis estimates how citation boost levels may change over time. Third, the current analysis includes evidence on the number of citations that may be due to data reuse.

Clinical microarray data provides a useful environment for the investigation: despite being valuable for reuse [Dudley] and well-supported by data sharing standards and infrastructure [Barrett], fewer than half of the studies that collect this data make it publicly available [Ochsner, Piwowar 2011].

Studies of citation benefit:

  • Gleditsch, Nils Petter & Håvard Strand, 2003. ‘Posting Your Data: Will You Be Scooped or Will You Be Famous?’, International Studies Perspectives 4(1): 89–97.
  • Henneken, Edwin A and Accomazzi, Alberto. Linking to Data – Effect on Citation Rates in Astronomy. eprint arXiv:1111.3618 11/2011
  • Ioannidis et al. Repeatability of published microarray gene expression analyses Nature Genetics 41, 149 – 155 (2009) . doi:10.1038/ng.295
  • Pienta et al The Research Data Life Cycle and the Probability of Secondary Use in Re-Analysis The Research Data Life Cycle and the Probability of Secondary Use in Re-Analysis
  • Amy M. Pienta, George Alter, Jared Lyle. The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data.
  • Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
  • Milia N, Congiu A, Anagnostou P, Montinaro F, Capocasa M, et al. (2012) Mine, Yours, Ours? Sharing Data on Human Genetic Variation. PLoS ONE 7(6): e37552. doi:10.1371/journal.pone.0037552

Other refs mentioned above:

  • Ochsner, S. A., Steffen, D. L., Stoeckert, C. J., & McKenna, N. J. (2008). Much room for improvement in deposition rates of expression microarray datasets. Nature Methods. Retrieved from
  • Tenopir C, Allard S, Douglass K, Aydinoglu AU, Wu L, et al. (2011) Data Sharing by Scientists: Practices and Perceptions. PLoS ONE 6(6): e21101. doi:10.1371/journal.pone.0021101
  • Piwowar HA (2011). “Who Shares? Who Doesn’t? Factors Associated with Openly Archiving Raw Research Data.” PLoS ONE6(7), pp. e18657
  • Dudley JT, Robert Tibshirani, Tarangini Deshpande, Atul J Butte (2009) Disease signatures are robust across tissues and experiments.  Molecular systems biology 5 p. 307
  • Tanya Barrett, Dennis B Troup, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Rolf N Muertter, Michelle Holko, Oluwabukunmi Ayanbule, Andrey Yefanov, Alexandra Soboleva (2011) NCBI GEO: archive for functional genomics data sets–10 years on.   Nucleic acids research 39 (Database issue) p. D1005-10

Posting this part was kinda cheating because it was already written.  Next, the first part of the methods section…. 


  1. Definitely the way to go. My co-blogger Matt Wedel and I recently wrote a series of six blog-posts responding in detail to a recently-published paper that we felt was badly flawed. Then we glued them together and … I’d like to say that we just called the result a manuscript, but the truth is we did a lot more work after that point than we intended to. In the event, nearly all the actual text got rewritten. But it was still waaay faster than it would have been to do it straight from brain to manuscript. The blogged version turned out to be a shortcut.

    BTW., here is the first of those six posts — it contains links to all the other parts.

    Comment by Mike Taylor — July 2, 2012 @ 3:32 pm

    • Thanks, Mike! I appreciate the feedback. This method seems to have gotten me unstuck: so far so good. I’ll have a look at your posts later when I fall off my roll :)

      Comment by Heather Piwowar — July 2, 2012 @ 6:41 pm

  2. […] second installment in my #draftInProgress series on Open Data citation advantage.  About one fourth of the methods […]

    Pingback by Citation11k: Method section — which studies? #draftInProgress « Research Remix — July 2, 2012 @ 4:53 pm

  3. […] third installment in my #draftInProgress series on Open Data citation advantage.  I reread the methods description in my Who Shares paper and […]

    Pingback by Citation11k: Method section — assessment of data availability #draftInProgress « Research Remix — July 2, 2012 @ 5:35 pm

  4. […] next installment in my #draftInProgress series on Open Data citation advantage.  I think this section can be short and […]

    Pingback by Citation11k: Method section — study attributes #draftInProgress « Research Remix — July 2, 2012 @ 6:37 pm

  5. […] next installment in my #draftInProgress series on Open Data […]

    Pingback by Citation11k: Method section — access to citation data #draftInProgress « Research Remix — July 3, 2012 @ 8:27 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: