Research Remix

August 17, 2011

Draft boilerplate text for journal policy on data citation

Filed under: Uncategorized — Heather Piwowar @ 9:19 am

A journal editor has asked me to supply boilerplate text on How To Cite Data, for adoption by his board later this week and subsequent integration into their Instructions for Authors and copyeditor guidelines.

Very few journals currently have such policies.  As part of the DataONE summer internship in 2010, Nic Weber reviewed the policies of 307 journal in Environmental Science and discovered that only 16 had instructions on how to cite data (description, raw data).  Of those with guidelines, very few recommended practices that translate to easily-trackable recognition for the data collecting investigators (= citing data in the references section, where it can be indexed by ISI Web of Science and Scopus).

Style guides often make it their business to lead best practices in citation, but there is still a lack of coverage and consensus as demonstrated by Newton, Mooney, and Witt in a 2010 IDCC poster.

Several libraries have stepped up to educate authors about how to cite data responsibly (eg PurdueMichigan State, U of Virginia, Australian National Data Service, U of Cambridge, Data-PASS, International Polar Year).  This is very useful.  That said, journals have the luxury of being more prescriptive.  What is the simplest thing you would recommend if you were in charge?

Several initiatives are underway to gather community consensus (Data Citation Principles workshop at Harvard, CODATA citation workshop, etc).  Consensus, however, takes a while.  Can we come up with something reasonable for now, as a first pass, so that this enthusiastic journal and others like it can take a first step towards clarity for authors, reviewers, and copyeditors?  It is in our best interest to encourage recognition for data reuse As Soon As Possible, even if the detailed recommendations change slightly over the next few years.

Here is a quick first-pass draft.  Suggestions?  (Hat tip to Pensoft for one of the examples.  They have a very thorough data citation policy!)  (What do we want to suggest for Genbank, anyway?  The suggestion below is sure a large change from the current community norm. )

Formal citations to data are important to recognize the contribution of investigators who collect data.  Data used in studies published by this journal must be attributed through mention in the study full text (usually in the methods section) and a citation in the references section.  Citations in supplementary information are not sufficient because they are not indexed and do not facilitate recognition.

The citation to the dataset should be take a form similar to citations to journal articles, with the addition of a unique dataset identifier (accession number, DOI, etc) and version or access date if needed to uniquely determine the resource.  Publications that describe the data collection article should also be cited when appropriate.  Here are some examples of appropriate attribution for data reuse:

Example 1: “We used data previously collected by Hill and Otto [23, 24]”

23.  Hill JA, Otto SP (2007) The role of pleiotropy in the maintenance of sex in yeast. Genetics 175: 1419-1427. doi:10.1534/genetics.106.059444

24.  Hill JA, Otto SP (2007) Data from: The role of pleiotropy in the maintenance of sex in yeast. Dryad Digital Repository. doi:10.5061/dryad.18

Example 2: “This paper uses data from the [name] data repository at http://dx.doi.org/***** (Jones et al. 2008a), first described in Jones et al. 2008b.”

Jones A, Bloggs B, Smith C (2008a). Title of data package. Repository name. doi:*****.
Jones A, Saul D, Smith C (2008b). Title of journal article. Journal Volume: Pages. doi:###. “

Example 3: “We downloaded the following sequences from Genbank: AJ428578 [22], NC004029 [23], X72004 [24].”

22. Arnason,U. Eumetopias jubatus complete mitochondrial genome. Genbank. AJ428578.2

23. Arnason,U. Odobenus rosmarus rosmarus mitochondrion, complete genome. Genbank. NC_004029.2

24. Arnason,U. Halichoerus grypus complete mitochondrial genome. Genbank. X72004.1

Are there any journal editors (or conference organizers, or dissertation-instruction-manual-writers) out there who plan to include this or something like it in their policies?  If so, I’d love to hear from you.

Note: the following issues are not in the scope of the text discussed here (but are clearly also worthy of well-considered journal policies):

  • recommendations or requirements on data archiving itself
  • guidelines on how to cite/link to data within the paper that collects the data (currently often handled by a sentence like “The data behind this study are available at…”, though it could/should instead be handled through a formal data citation)

Edited to add a few more links

17 Comments

  1. Did I miss your favourite library resource on data citation? If so, add it here!

    Comment by Heather Piwowar — August 17, 2011 @ 9:46 am

  2. Probably a naive question, but how do you cite data from more established collections where there are multiple “authors”? Is there a sensible equivalent to “editors” on dataset citations?

    I’m thinking of things stretching from e.g. Uniprot (which has its own citation guidelines based on citing a publication: http://www.uniprot.org/help/publications) to the many smaller datasets which have coordinators collating contributions.

    Comment by Neil P Chue Hong (@npch) — August 17, 2011 @ 9:56 am

  3. It is a great question. I don’t think there is consensus, nor am I familiar with a straw-man suggestion.

    Similarly, it isn’t clear who the authors should be when citing a Genbank sequence. Surely not just the original submitters, when others have made substantial updates and corrections. For example, the provenance for this sequence looks pretty complicated: http://www.ncbi.nlm.nih.gov/nuccore/NM_053055.4. How would a data-reuser know which authors to credit?

    I can see why people just use the accession number in these cases!

    (though I think for many types of data the attribution is not so complicated. My goal is to work towards attribution for these easy cases NOW, while we continue trying to figure out how to address the complicated cases. Is that currently reflected well enough in the policy text above? If not, how to improve it?)

    Comment by Heather Piwowar — August 17, 2011 @ 10:17 am

  4. I just realized one thing this draft is missing: instructing authors to honour the citation guidelines of the data repositories, when they exist.

    Comment by Heather Piwowar — August 17, 2011 @ 10:24 am

  5. One of the really tough things here is that the standards have been so out-dated. ISO 690-2:1997 (Information and documentation – Bibliographic references – Part 2: Electronic documents or parts thereof) was issued in 1997, when the Internet had barely got started. Z39.29 is barely better, with an emphasis still on citing CDs and such-like. For years I used to recommend the NLM guide (Patrias, K. (2001). National Library of Medicine Recommended Formats for Bibliographic Citation Supplement: Internet Formats. NLM. Retrieved from http://www.nlm.nih.gov/pubs/formats/internet2001.pdf), but I think that’s been changed to be much less useful and more generic. Both of the latter are (or were) openly accessible.

    I understand that ISO 690 has been released in a new version in 2010, but I no longer have access to ISO standards. Your library may have access (ours was hidden away in some library database, in our case British Standards Online I think).

    Comment by Chris Rusbridge — August 17, 2011 @ 10:43 am

  6. From the NLM guide (best of the more accessible, you’re looking at section 3, and adapt as appropriate…

    Comment by Chris Rusbridge — August 17, 2011 @ 11:44 am

  7. I think a guy from Statistics Canada was (I think) involved in the ISO-690 re-development, name of Gaetan Drolet, see for example https://ospace.scholarsportal.info/handle/1873/226. Might respond to an email? (Used to be at mailto:gaetan.drolet@arul.ulaval.ca)

    Comment by Chris Rusbridge — August 17, 2011 @ 12:36 pm

  8. Regarding the unique dataset identifier, I would suggest recommending a HTTP URI notation rather than the accession number or DOI notation. By which I mean listing http://dx.doi.org/… rather than doi:… (and hyperlinking the latter with http://dx.doi.org/…). This (a) is consistent with new CrossRef recommendations with this regard (b) yields a uniform identifier notation irrespective of the underlying worldview of assigners of identifiers regarding the need for special-purpose “persistent identifiers” versus just (stable/cool) HTTP URIs (c) emphasizes the fact that – in the end – it is all about access to the data, not just about knowing its identifier.

    Comment by Herbert Van de Sompel — August 17, 2011 @ 12:56 pm

  9. […] are in the way?  We need policies to get authors to include data citations in reference lists (feedback wanted on this draft policy).  But how feasible is this, really?  What about cases when authors use many, many datasets, […]

    Pingback by Citations in Supplementary Material can be indexed! « Research Remix — August 17, 2011 @ 1:52 pm

  10. I would strongly advocate for including a dataset version and a download date in the citations. As I understand it, DOI is working on including versioning structures, but, the current implementations would probably just point to the latest version of the dataset. Insert commentary here about concreteness of the reference and reproducibility, etc. etc.

    Comment by L. Wynholds — August 17, 2011 @ 3:18 pm

  11. At F1000, we are planning to include instructions on how best to cite data when we launch our new OA journal that will include specific articles associated with datasets, and so this is very helpful indeed, Heather. Has anyone given any thought to how to provide additional credit to data authors in subsequent papers that are based in full or in part on someone else’s dataset, i.e. beyond a simple citation and more to some level of data co-authorship of the new paper so their contributions can be more formally recognised? It might provide a significant incentive to get authors to provide access to their data if they know they will get some level of authorship credit on work that comes out of it, but how this works in practice needs some thought.

    Comment by Rebecca Lawrence — August 18, 2011 @ 2:22 am

  12. This is great, Heather, very helpful and concise. I would suggest using a URL in at least one of your examples, perhaps with a short note indicating to use a better identifier (e.g., DOI) when one is available. Unfortunately this is a necessary evil because many data can only be identified and dereferenced by URL.

    I’ll add Purdue’s to the list of resource guides on data citation: http://guides.lib.purdue.edu/datacitation, since you asked. Thanks!

    Comment by Michael Witt — August 21, 2011 @ 5:55 pm

  13. Maybe you’ve already heard this at the Harvard workshop, but this is what the effort looks like in astronomy: http://vo.ads.harvard.edu/dv/

    Comment by Edwin Henneken — August 23, 2011 @ 3:46 am

  14. Thanks for this post ,Heather. It’s a worthwhile initiative and I’ll be following with interest. At BioMed Central, journals that have implemented the new ‘Availability of supporting data’ article section (http://blogs.openaccesscentral.com/blogs/bmcblog/entry/availability_of_supporting_data_crediting) now provide more information about citing and linking to data. For example:

    http://www.biomedcentral.com/bmcresnotes/ifora/#availability_of_supporting_data
    “Availability of supporting data

    BMC Research Notes encourages authors to deposit the data set(s) supporting the results reported in submitted manuscripts in a publicly-accessible data repository, when it is not possible to publish them as additional files. This section should only be included when supporting data are available and must include the name of the repository and the permanent identifier or accession number and/or persistent hyperlink(s) for the data set(s). The following format is required:

    “The data set(s) supporting the results of this article is(are) available in the [repository name] repository, [unique persistent identifier/link for dataset(s)].”

    Where all supporting data are included in the article or additional files the following format is required:

    “The data set(s) supporting the results of this article is(are) included within the article (and its additional file(s))”

    We also recommend that the data set(s) be cited, where appropriate in the manuscript, and included in the reference list.

    A list of available scientific research data repositories can be found here. A list of all BioMed Central journals that require or encourage this section to be included in research articles can be found here.”

    So citation is recommended, if you’re linking to your data however, I agree that more explicit information about data citation could be useful for general reference/style guides (e.g. http://www.biomedcentral.com/bmcresnotes/ifora/#references). My view is that the 2-paragraph preamble to the examples you propose may not be necessary for all journals, especially where they already have data policies (such as Research Notes). Instead we might want to modify the text:

    “Only articles and abstracts that have been published or are in press, or are available through public e-print/preprint servers, may be cited”

    to

    “Only articles, datasets and abstracts that have been published or are in press, or are available through public e-print/preprint servers, may be cited”

    and then add one or two additional examples to the list provided on http://www.biomedcentral.com/bmcresnotes/ifora/#references, specifically for datasets

    Hill JA, Otto SP: Data from: The role of pleiotropy in the maintenance of sex in yeast. Dryad Digital Repository 2007, doi:10.5061/dryad.18

    Would probably be more fitting with house style. The challenge for a broad-scope journal is being inclusive enough of all relevant repositories, as very many repositories and identifiers (handles, DOIs, accession numbers etc) could be relevant to a general biology and medical journal. This will need further discussion and I’ll certainly raise with our editors at their next internal policy forum.

    Comment by Iain Hrynaszkiewicz — August 23, 2011 @ 8:40 am

    • Hi Iain, note that DataCite and CrossRef recommend DOIs are displayed as permanent URLs (as mentioned by Herbert above), so the Hill citation would be better displayed as: Hill JA, Otto SP: Data from: The role of pleiotropy in the maintenance of sex in yeast. Dryad Digital Repository 2007, http://dx.doi.org/10.5061/dryad.18

      Comment by tomjpollard — September 10, 2011 @ 3:21 pm

  15. Good point — yes it’s intended for dataset identifiers/citations at BMC to be functional links (even if not in my example above). Thanks

    Comment by Iain Hrynaszkiewicz — September 11, 2011 @ 2:02 am

  16. Another resource:
    http://www.esds.ac.uk/international/access/citing.asp

    Feel free to add more in a comment!

    Comment by Heather Piwowar — November 1, 2011 @ 7:16 am


RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at WordPress.com.

%d bloggers like this: