May 5, 2011

Links from the data collection article: Inline or in the bibliography?

In the circles in which I run, there is a general consensus recommendation for data attribution upon data reuse:  cite the dataset in the bibliography section of the paper that reuses the data (see Dryad recommendations, for example).

This solution is not perfect, but it is a pretty good recommendation in most cases.

There is less consensus on this question:  How should investigators link to their archived datasets in the paper that initially describes the data collection?  Inline, or in the bibliography?  This is relevant in cases where data is deposited before publication.

Currently, data accession numbers are usually mentioned in full text somewhere in the body of the article.  The location in the article varies, sometimes it is in the methods section, sometimes in the results, sometimes the journal has a specific “Availability” section.

Is this what we would recommend?  Or should we recommend that the data collection investigators cite their datasets as first-class entities in the bibliography, mirroring the behaviour we suggest for investigators who later reuse the datasets?

Here are some of the advantages to the cultural norm of citing datasets in the bibliography of the *data collection* article:

  • gets investigators (and therefore funders, policy makers, etc) more used to seeing citations to datasets, so they don’t think it is weird
  • educates and gives models to readers for how to properly cite datasets they are going to reuse, so they are more likely to do it and do it properly when the time comes
  • archives can train depositing investigators on how to do it in their instance with hands-on cut and paste text… the investigators will probably then be more likely to do it in the future themselves upon data reuse
  • makes “here’s how to refer to your/any dataset” instructions a lot simpler
  • every dataset gets at least one citation ;)
  • and most importantly, it creates more explicit, unambiguous, best practice links between datasets and papers.  The links are in the bibliography which is often in front of paywalls, is certainly indexed more than full text, and is where convention has it that links go.  Better and consistent linked data and papers = win for everybody in many dimensions.
Here are some disadvantages:
  • Different than how people usually reference Genbank, PDB,  etc data, so blazing new ground
  • No [see Pensoft correction below! so instead I’ll say…] Few journals have standardized on this approach so far
  • Adding the data archiving reference comes late in the lifecycle of paper publication.  At this point is it more difficult to add another reference to the draft than a sentence in full text, especially for papers that have a maximum-number-of-citations rule?
  • It makes citation context more ambiguous, since a reference could be for sharing or reuse.  This really complicates my research-into-data-reuse-patterns life, but oh well!  Bring on CiTO :)


  1. Hi heather,

    Great post. I’m curious what to do about data citation with the problem that many journals limit the maximum number citations. I heard from one journal chief editor of a case where the authors had attained data from many sources on the condition that they cite it (or the source paper) in the bibliography (for the reasons you mention, chiefly to be indexed as a citation by web of science, etc), only to learn from the journal that this exceeded the journal limits. The other authors refused to have the data cited ony in a supplement (as per agreement, since that would be unindexed). In this case the editor granted an exception, but was curious what a more general solution should look like.

    Comment by Carl Boettiger — May 5, 2011 @ 1:33 pm

  2. Good question, Carl. I don’t know of any good general purpose solution to this yet, and is indeed a real problem. Lots of possible solutions, but they all require compromises where I doubt it is easy to reach compromise. ok, I guess have had one idea… a “citation registry”, basically like url-shortening but instead citation-shortening. Similar to this description, but tweaked to be a bit more like how url-shortening works:
    Then all (hmmm) we have to do is get the citation databases to index it :)

    Comment by Heather Piwowar — May 5, 2011 @ 2:49 pm

  3. I do have a bit of evidence to quantify the magnitude of this problem in one area. I did a quick analysis of papers that reuse GEO gene expression data and found that 3% of articles that reuse data actually reuse >= 10 datasets. This is indeed a problem when publishing in a journal that only allows 30 references! For a graph of this preliminary analysis, see slide 33 here:
    I’ll try to blog it soon too….

    Comment by Heather Piwowar — May 5, 2011 @ 3:02 pm

  4. great post. Similar to one of your items, Gary King argues that giving the data set a citation would be an extra incentive for the researcher to get it publicly available in the first place

    Comment by Ricardo Pietrobon — May 5, 2011 @ 6:35 pm

  7. You say “No journals have standardized on this approach so far”. However, Pensoft Journals (, which specializes in publishing biodiversity and biological systematics papers, recently published Data Publishing Policies and Guidelines for Biodiversity Data [1], that has a three-page section on how to cite data in Pensoft Journals.

    While allowing that citations of Genbank data and similar are by custom made by placing the accession number somewhere in the text, we make the following generic recommendation:

    “Data citations may relate either to the author‘s own data, or to data created and published by others (“third-party data”). In the former case, the dataset may have been previously published, or may be published for the first time in association with the article that is now citing it. All these types of data should, for consistency, be cited in
    the same manner.

    “As is the norm when citing another research article, any citation of a data publication, including a citation of one’s own data, should always have two components:

    “• An in-text citation statement containing an in-text reference pointer that directs the reader to a formal data reference in the paper’s reference list.
    • A formal data reference within the article‘s reference list.

    “The data reference in the article‘s reference list should contain the minimal components recommended in the DataCite Metadata Kernel v2.0 specification. In DataCite terms: Creator PublicationYear Title Publisher Identifier; alternatively (but meaning the same thing): Author PublicationYear Title DataRepositoryName DOI. These components should be presented in whatever format and punctuation style the journal specifies for its references.
    The following example demonstrates in general terms what is required.

    “In-text citation:
    “This paper uses data from the [name] data repository at***** (Jones et al. 2008a), first described in Jones et al. 2008b. “

    “Data reference and article reference in reference list:
    “Jones A, Bloggs B, Smith C (2008a). Title of data package. Repository name. doi:*****.
    Jones A, Saul D, Smith C (2008b). Title of journal article. Journal Volume: Pages. doi:#####.” ”

    Pensoft also recommends that the in-text data citation statement should, in Pensoft journals, be included in the body of the paper, in a separate section named Data Resources, situated after the Material and Methods section.
    More details are given in the paper [1].

    Furthermore, Pensoft has reached an agreement for cooperation in data hosting and developing of data publishing workflows with the GBIF, the Dryad Data Repository and the Consortium for Barcode of Life.

    Clearly, these Pensoft data citation recommendations, which work fine for on-line journals without a numerical limit on the number of citations, would not be feasible in journals with a strict limit to the number of citations, which is why your emphasis of exploring alternative ways for data citation in such cases is important.

    [1] Citation: Penev L, Mietchen D, Chavan V, Hagedorn G, Remsen D, Smith V, Shotton D (2011). Pensoft Data Publishing Policies and Guidelines for Biodiversity Data. Pensoft Publishers,

    Comment by davidshotton — June 25, 2011 @ 12:14 am

  9. I absolutely agree that data citations belong in the bibliography. Data centers have learned to their dismay that lack of citation in the bibliography leads to lack of respect/appreciation for the data availability by funding agencies and others (e.g., see example of DSDP – Deep Sea Drilling Project). Trying to compile a listing of research based upon data usage retrospectively is a real challenge. It’s time to end the segregation of references based upon perceptions of ‘legitimacy’. Citations to utilized resources should be treated equally.

    Comment by Linda Musser — January 2, 2012 @ 12:27 pm

