Research Remix

December 7, 2011

a future where data attribution Counts #idcc11

Filed under: Uncategorized — Heather Piwowar @ 1:40 am

Below is the rough text of my #idcc11 (International Data Curation Conference) talk.  Slides at slideshare, updated now to now include  (very tiny) text in speaker notes on each slide.  [Anyone know how to increase size of font of presenter notes and/or extract them in to text document from Keynote?]

A future where data attribution Counts.

Sharing data makes our shoulders broader.

This is a great story, right? And why where are all here.

But it is also a great illustration of the problem

What exactly do broad shoulders get the individual researcher?


Nobody looks at the supporting structure of an impressive tower.  We are all busy oggling the top.  That means these people? These ones with the shoulders? They’ve got nothing.

Ok, maybe they have some citations.  But do we think the promise of citation is enough?


Don’t get me wrong, I’m a fan of studies that show a citation benefit for sharing data :) . But it won’t be enough.

If it were, we’d have researchers knocking down the doors of our IR for the 10 minute job of sending in their preprints. They aren’t doing that. Because a few citations, as much as we’d like to think otherwise, aren’t enough to offset the Fear Uncertainty and Doubt that accompanies the costs of uploading a dataset in the current culture.


What to do about it? How to change the culture?

We need to facilitate deep recognition of the labour of dataset creation.

Ok let me say that again because it is so important

We need to facilitate deep recognition of the labour of dataset creation.

And while we are at it, we need to value the contributions of funders, the people who pay for all the gym equipment to help us build to the shoulders, and data repositories, who we might like to view as perhaps personal trainers.

Let’s dig in to how these groups do impact tracking now, and how they’d like to do it in the future.

Investigators, today, can list research products on CV

A CV is sort of bland, don’t you think? It has no context of use.

We can see one version of a more useful future comes from a tool called total-Impact.  Continuing a project that started as a hackathon at the Open Society Foundation workshop Beyond Impact organized by Cameron Neylon here in the UK last spring, Jason Priem, me, and a few other people have been working on a tool called total-impact.

It aggregates metrics for papers and also non-traditional research metrics, like datasets. The metrics are citations, but also altmetrics, or article-level citations…. various indications that others have found your research worth bookmarking, or blogging, or referencing on Wikipedia. It doesn’t currently look for dataset identifiers in public R packages, but it could, for example, as indication of use.

This makes a “live CV” if you will, giving post-publication context to research output.  (Could also be applicable as a CV for a department, or a grant, or a grant portfolio….)

To do this really well we need to be able to list all metrics. Right now, many, are unavailable for this sort of mashup due to licending terms. This includes citations identified by Google Scholar, Thomson, and Scopus.

Repositories, today, can look at graphs of their deposit counts.

Many know their own download statistics, some share this with their authors or the public.

As a result of intenstive manual digging, some have metrics about how mamy times their datasets have been mentioned in the literature. I’ll splash by a few graphs of preliminary research findings…. come find me or my blog if you want more info. We are starting to be able to estimate third party reuse. Tools that support data citation will help this.

This is all a nice start.

What repositories really want, though, though — correct me if I’m wrong — is to show that they are indispensable. That they generate new, profound science not otherwise possible. That they are a great financial investment in scientific progress.  This requires knowing more than just a citation count, it requires knowing the context of reuse. This means we need access to the full text of the paper that cites the data.

What about funders?

They want to know the impact the data had on society.  Did it facilitate innovation, reduce discrimination, create jobs, save the rainforest, increase our GDP.

That kind of tracking is beyond what I know how to do :)

We’re going to need digital tracking technology that as far as I know isn’t available yet but I’m sure people are working on. Google analytics meets digital RF-ID tags…. I dunno… but I do know we need it.  Furthermore, we need these digital tracking mechanisms to be affordable and open, to facilitate mashups.

Ok, so with that sort of future vision for tracking, what do we need as a scholarly ecosystem need to power this future world?

We need innovation and experimentation.

We need 1000 flowers blooming.

We need solutions that are open and generative.

I don’t have all the answers, but here is part of it:

  • open access to citation data. We can’t just rely on Scopus, Thomson, and Google Scholar. Those are only three players, They good at what they do and have been invaluable, but they can’t possibly be as nimble as a whole bunch of startups. It is taking them a long time to come out with a data tracking tool. Why? Probably because they have an ambitious vision and need time to fit it into their other product offerings. Some of the rest of us would be happy with iterating on a quick and dirty solution. We need more competition in this space. The barrier to entry is extrodinarily high because of course reference lists are almost all behind copyright and paywalls…. but open access gives us a toehold.
  • open access to full text. Open access also gives us a toehold into citation context information. A citation to a dataset tells us that the dataset played some role in that new research paper. What role? Was it used to validate a new method? Detect errors? Was it combined with other datasets to solve a problem that was otherwise intractable? The answers to these questions are fundamental to what funders and others need to know about impact. It won’t be easy to derive them from the text of the paper, but I strongly believe it is possible.  We need this to be true open access — can be used by anyone for any purpose — none of this Non Commercial nonsense… we must allow startups to use this information if we are going to get the innovation we need.
  • open access to other metrics of use. We need broad-based metrics… not just citations, but blog posts about data, slides, tutorials that include R data, bookmarks to data on bookmarking sites. Altmetrics.  If you run a data repository, make your download stats publicly available. We frankly don’t know what all of this info means yet, but we didn’t know what citations to papers meant 50 years ago either. We’ll all figure it out, the more data the better.

Here’s what each of us need to do:

The future where data attribution Counts.

The future is about what kind of impact a dataset makes,
not just a citation number.

The future is open.

Open data.

Open data about our data.

Blog at

%d bloggers like this: