Research Remix

April 19, 2012

Care about data citation? Then you care about text-mining access.

Filed under: Uncategorized — Heather Piwowar @ 8:14 am

Text-mining access sounds like a pretty niche need.  It isn’t.  One of the use cases is near and dear to readers of this blog.

You know all of our excitement about research data?  Making it available, making it citable, rewarding the data-producing investigators?  It only works if we can indeed identify the research that is built on a dataset and then include the reuse stats in CVs and reports and webpages.  What tools can we use for this today?

Google Scholar: nope.  They’ve said that if they support tracking datasets today it is by accident and they plan to remove such support in the future.  Futhermore Google Scholar offers no API access to its data (and has said it will not offer such access for years, if ever), so any numbers you calculate with it today can only be viewed on the Google website or through manual copy-and-paste into reports.

Web of Science and Scopus:  nope.  They still don’t support tracking dataset identifiers.   I believe they’ve been working on it — for at least the last two years.  Apparently not much of a priority.  Even when it does work, the terms of use on their API access forbid open redistribution of the statistics, so no great reports and no custom stats on your data repository pages.

total-impact and a few altmetrics tools are actively building  support for tracking dataset reuse.

You know what these altmetrics tools — or any new tool that we hope will solve this problem — needs to be successful?  Programmatic access to the literature.  Need to be able to search across all papers for dataset identifiers and find them in full text and reference lists.  Furthermore, ideally we need to be able to do more advanced text analysis to distinguish between whether an ID is mentioned because the paper is talking about having *gotten* the data from somewhere as opposed to having *put* it somewhere.

Anyway.  This?  This is text mining.  This is just one of the reasons we all need text-mining access, we need it to everything, and we need it now.

Get the ball rolling at your institution too.

This article is translated to Serbo-Croatian by Jovana Milutinovich from


  1. About a month ago, Google Scholar removed all references to Dryad data. Pangaea data is also gone. I’m not aware of any data that is still in their index.

    Comment by Ryan Scherle — April 19, 2012 @ 9:13 am

  2. It’s probably linked from this blog already, but for readers interested in this area, may I recommend JISC’s value and benefits of text mining report

    Comment by ambrouk — May 7, 2012 @ 10:05 am

  3. […] tracked and admired and used.  Who is indexing data citations right now?  As far as I can tell: absolutely no one.  Worse yet: who CAN start this innovation, and index data citations in a big way?  No one, […]

    Pingback by Dear research data advocate, please sign the petition #OAMonday « Research Remix — May 29, 2012 @ 10:52 am

  4. […] This announcement good news, because at the moment none of the major citation-tracking tools track datasets. […]

    Pingback by Thomson Reuters announces subscription-based data citation tracking tool « Research Remix — June 28, 2012 @ 7:33 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: