Does everyone else know this already? I didn’t.
Tracking data reuse is important. Tracking data reuse is currently difficult. But people are working on processes and policies to make it easier in the future. Surely we just need repositories to assign dataset DOIs (or similar unique identifiers), investigators to start citing datasets as first-class entities using these unique identifiers, and then poof it will be as easy to track dataset citations as it currently is to track article citations.
The same tools will just work.
Or, naively, so I thought.
It turns out that our tools need to get on board. Right now, tracking dataset citations using common citation tracking tools doesn’t work. ISI Web of Science and Scopus? It looks like they strip DOIs and URLs out of citations as they import references into their database…. meaning that data citation identifiers fall through the cracks. In addition, ISI Web of Science only supports citation searches based on fields like author, journal, volume, etc. One cannot search across the whole reference field or by the DOI facet specifically. Arg.
In my experience, the best way to track dataset DOIs right now is through Google Scholar. This is useful, but Google Scholar lacks well defined coverage and clean data export: it is not the robust solution we need.
What can we do? Does anyone know the right people at Scopus and ISI Web of Science to contact? Or is it that the publishers aren’t sending them the correct info??? How does this process work? Is DataCite addressing this, and I should just be patient?
How can we make our voice heard by these tool makers?
Tracking unique identifiers in reference sections is important, and it needs to be supported ASAP.
To illustrate the issue, I attempted to track the datasets that have been assigned DOIs by Oak Ridge National Labs DAAC repository (more info on ORNL DAAC DOIs).
In Google Scholar
To begin, I searched Google Scholar for 10.3334/ORNLDAAC/*. At the time of my initial data collection this query returned 757 hits.
Most of these hits were to documents describing the creation of datasets. To exclude these and focus on data reuse and attribution, I excluded hits from the ORNL DAAC website itself: 10.3334/ORNLDAAC/* -site:ornl.gov. This returned 42 hits.
Bravo Google Scholar!
Unfortunately, this doesn’t constitute a robust solution for tracking data reuse. The scope of Google Scholar is not clear, so it isn’t well defined what reuses exist but are not returned by this search. It is searching in full text rather than just the references section: sometimes useful, but not what we intended. Furthermore, there is no way to easily download a complete set of citation information from the results page.
These features are strengths of the formal citation databases, ISI Web of Science and Scopus.
In ISI Web of Science
In trying to find the citations to DAAC datasets from ISI Web of Science, we get stuck before we even get started. How to search? The ISI reference documents that I’ve found don’t mention any way to search based on a DOI, or indeed any way to search the whole reference text in an unfaceted way:
I tried a few tags in case they were synonyms for dataset fields, but I didn’t find anything. For fun, I went digging to see if the DOIs are in the WoS database, even if I can’t search for them. Nope, the DOI and other identifying information is stripped out. For example, a citation shows up as this in the references section of a recent paper, the middle reference here by Randerson:
Tracking dataset citations in Scopus seemed more promising, since Scopus does indeed support searching by doi as well as by “All Fields”:
One hit! But only one.
What happened to all of the other hits???
As you may remember, Google Scholar found 42 hits. So I systematically went through each of the 42 hits to see if they were DOI mentions within references sections of papers indexed by Scopus, and if they were, why were the DOIs not found by the query. My quick investigation suggested that of the 42 Google Scholar hits, at least 14 hits were formal papers, indexed by Scopus, with a dataset DOI in the references section.
I investigated the “reference lists” of each of these 14 papers in Scopus (raw findings). The DOI did make it into the “references” Scopus view in at least 7 cases, but in all 14 cases the DOI was omitted from the formal citation view in the Scopus database.
Interestingly, there is the one case shown above where a DOI search managed to recover the dataset. I don’t know why or how that citation is special. More research is needed. And I’d love to add some screenshots to this section to show you what the reference list looks like and how it differs from the citation view in the Scopus database (or whatever those two views are called). Unfort my institutional affiliation with a Scopus subscription has lapsed, so alas I can’t pursue it at the moment.
Anyway. That is more detail than anyone really wanted, and a bit quick and dirty at that. But there you have it. My conclusion: tracking dataset citations using the common citation tracking tools doesn’t work.
The tools need to support data citation tracking.
Thanks to DataONE summer intern Valerie Enriquez for her work this summer on these broad issues!