A journal editor has asked me to supply boilerplate text on How To Cite Data, for adoption by his board later this week and subsequent integration into their Instructions for Authors and copyeditor guidelines.
Very few journals currently have such policies. As part of the DataONE summer internship in 2010, Nic Weber reviewed the policies of 307 journal in Environmental Science and discovered that only 16 had instructions on how to cite data (description, raw data). Of those with guidelines, very few recommended practices that translate to easily-trackable recognition for the data collecting investigators (= citing data in the references section, where it can be indexed by ISI Web of Science and Scopus).
Style guides often make it their business to lead best practices in citation, but there is still a lack of coverage and consensus as demonstrated by Newton, Mooney, and Witt in a 2010 IDCC poster.
Several libraries have stepped up to educate authors about how to cite data responsibly (eg Purdue, Michigan State, U of Virginia, Australian National Data Service, U of Cambridge, Data-PASS, International Polar Year). This is very useful. That said, journals have the luxury of being more prescriptive. What is the simplest thing you would recommend if you were in charge?
Several initiatives are underway to gather community consensus (Data Citation Principles workshop at Harvard, CODATA citation workshop, etc). Consensus, however, takes a while. Can we come up with something reasonable for now, as a first pass, so that this enthusiastic journal and others like it can take a first step towards clarity for authors, reviewers, and copyeditors? It is in our best interest to encourage recognition for data reuse As Soon As Possible, even if the detailed recommendations change slightly over the next few years.
Here is a quick first-pass draft. Suggestions? (Hat tip to Pensoft for one of the examples. They have a very thorough data citation policy!) (What do we want to suggest for Genbank, anyway? The suggestion below is sure a large change from the current community norm. )
Formal citations to data are important to recognize the contribution of investigators who collect data. Data used in studies published by this journal must be attributed through mention in the study full text (usually in the methods section) and a citation in the references section. Citations in supplementary information are not sufficient because they are not indexed and do not facilitate recognition.
The citation to the dataset should be take a form similar to citations to journal articles, with the addition of a unique dataset identifier (accession number, DOI, etc) and version or access date if needed to uniquely determine the resource. Publications that describe the data collection article should also be cited when appropriate. Here are some examples of appropriate attribution for data reuse:
Example 1: “We used data previously collected by Hill and Otto [23, 24]”
23. Hill JA, Otto SP (2007) The role of pleiotropy in the maintenance of sex in yeast. Genetics 175: 1419-1427. doi:10.1534/genetics.106.059444
24. Hill JA, Otto SP (2007) Data from: The role of pleiotropy in the maintenance of sex in yeast. Dryad Digital Repository. doi:10.5061/dryad.18
Example 2: “This paper uses data from the [name] data repository at http://dx.doi.org/***** (Jones et al. 2008a), first described in Jones et al. 2008b.”
Jones A, Bloggs B, Smith C (2008a). Title of data package. Repository name. doi:*****.
Jones A, Saul D, Smith C (2008b). Title of journal article. Journal Volume: Pages. doi:###. “
Example 3: “We downloaded the following sequences from Genbank: AJ428578 , NC004029 , X72004 .”
22. Arnason,U. Eumetopias jubatus complete mitochondrial genome. Genbank. AJ428578.2
23. Arnason,U. Odobenus rosmarus rosmarus mitochondrion, complete genome. Genbank. NC_004029.2
24. Arnason,U. Halichoerus grypus complete mitochondrial genome. Genbank. X72004.1
Are there any journal editors (or conference organizers, or dissertation-instruction-manual-writers) out there who plan to include this or something like it in their policies? If so, I’d love to hear from you.
Note: the following issues are not in the scope of the text discussed here (but are clearly also worthy of well-considered journal policies):
- recommendations or requirements on data archiving itself
- guidelines on how to cite/link to data within the paper that collects the data (currently often handled by a sentence like “The data behind this study are available at…”, though it could/should instead be handled through a formal data citation)
Edited to add a few more links