In the circles in which I run, there is a general consensus recommendation for data attribution upon data reuse: cite the dataset in the bibliography section of the paper that reuses the data (see Dryad recommendations, for example).
This solution is not perfect, but it is a pretty good recommendation in most cases.
There is less consensus on this question: How should investigators link to their archived datasets in the paper that initially describes the data collection? Inline, or in the bibliography? This is relevant in cases where data is deposited before publication.
Currently, data accession numbers are usually mentioned in full text somewhere in the body of the article. The location in the article varies, sometimes it is in the methods section, sometimes in the results, sometimes the journal has a specific “Availability” section.
Is this what we would recommend? Or should we recommend that the data collection investigators cite their datasets as first-class entities in the bibliography, mirroring the behaviour we suggest for investigators who later reuse the datasets?
Here are some of the advantages to the cultural norm of citing datasets in the bibliography of the *data collection* article:
- gets investigators (and therefore funders, policy makers, etc) more used to seeing citations to datasets, so they don’t think it is weird
- educates and gives models to readers for how to properly cite datasets they are going to reuse, so they are more likely to do it and do it properly when the time comes
- archives can train depositing investigators on how to do it in their instance with hands-on cut and paste text… the investigators will probably then be more likely to do it in the future themselves upon data reuse
- makes “here’s how to refer to your/any dataset” instructions a lot simpler
- every dataset gets at least one citation ;)
- and most importantly, it creates more explicit, unambiguous, best practice links between datasets and papers. The links are in the bibliography which is often in front of paywalls, is certainly indexed more than full text, and is where convention has it that links go. Better and consistent linked data and papers = win for everybody in many dimensions.
- Different than how people usually reference Genbank, PDB, etc data, so blazing new ground
No[see Pensoft correction below! so instead I’ll say…] Few journals have standardized on this approach so far
- Adding the data archiving reference comes late in the lifecycle of paper publication. At this point is it more difficult to add another reference to the draft than a sentence in full text, especially for papers that have a maximum-number-of-citations rule?
- It makes citation context more ambiguous, since a reference could be for sharing or reuse. This really complicates my research-into-data-reuse-patterns life, but oh well! Bring on CiTO :)