Research Remix

December 7, 2011

a future where data attribution Counts #idcc11

Filed under: Uncategorized — Heather Piwowar @ 1:40 am

Below is the rough text of my #idcc11 (International Data Curation Conference) talk.  Slides at slideshare, updated now to now include  (very tiny) text in speaker notes on each slide.  [Anyone know how to increase size of font of presenter notes and/or extract them in to text document from Keynote?]

A future where data attribution Counts.

Sharing data makes our shoulders broader.

This is a great story, right? And why where are all here.

But it is also a great illustration of the problem

What exactly do broad shoulders get the individual researcher?


Nobody looks at the supporting structure of an impressive tower.  We are all busy oggling the top.  That means these people? These ones with the shoulders? They’ve got nothing.

Ok, maybe they have some citations.  But do we think the promise of citation is enough?


Don’t get me wrong, I’m a fan of studies that show a citation benefit for sharing data :) . But it won’t be enough.

If it were, we’d have researchers knocking down the doors of our IR for the 10 minute job of sending in their preprints. They aren’t doing that. Because a few citations, as much as we’d like to think otherwise, aren’t enough to offset the Fear Uncertainty and Doubt that accompanies the costs of uploading a dataset in the current culture.


What to do about it? How to change the culture?

We need to facilitate deep recognition of the labour of dataset creation.

Ok let me say that again because it is so important

We need to facilitate deep recognition of the labour of dataset creation.

And while we are at it, we need to value the contributions of funders, the people who pay for all the gym equipment to help us build to the shoulders, and data repositories, who we might like to view as perhaps personal trainers.

Let’s dig in to how these groups do impact tracking now, and how they’d like to do it in the future.

Investigators, today, can list research products on CV

A CV is sort of bland, don’t you think? It has no context of use.

We can see one version of a more useful future comes from a tool called total-Impact.  Continuing a project that started as a hackathon at the Open Society Foundation workshop Beyond Impact organized by Cameron Neylon here in the UK last spring, Jason Priem, me, and a few other people have been working on a tool called total-impact.

It aggregates metrics for papers and also non-traditional research metrics, like datasets. The metrics are citations, but also altmetrics, or article-level citations…. various indications that others have found your research worth bookmarking, or blogging, or referencing on Wikipedia. It doesn’t currently look for dataset identifiers in public R packages, but it could, for example, as indication of use.

This makes a “live CV” if you will, giving post-publication context to research output.  (Could also be applicable as a CV for a department, or a grant, or a grant portfolio….)

To do this really well we need to be able to list all metrics. Right now, many, are unavailable for this sort of mashup due to licending terms. This includes citations identified by Google Scholar, Thomson, and Scopus.

Repositories, today, can look at graphs of their deposit counts.

Many know their own download statistics, some share this with their authors or the public.

As a result of intenstive manual digging, some have metrics about how mamy times their datasets have been mentioned in the literature. I’ll splash by a few graphs of preliminary research findings…. come find me or my blog if you want more info. We are starting to be able to estimate third party reuse. Tools that support data citation will help this.

This is all a nice start.

What repositories really want, though, though — correct me if I’m wrong — is to show that they are indispensable. That they generate new, profound science not otherwise possible. That they are a great financial investment in scientific progress.  This requires knowing more than just a citation count, it requires knowing the context of reuse. This means we need access to the full text of the paper that cites the data.

What about funders?

They want to know the impact the data had on society.  Did it facilitate innovation, reduce discrimination, create jobs, save the rainforest, increase our GDP.

That kind of tracking is beyond what I know how to do :)

We’re going to need digital tracking technology that as far as I know isn’t available yet but I’m sure people are working on. Google analytics meets digital RF-ID tags…. I dunno… but I do know we need it.  Furthermore, we need these digital tracking mechanisms to be affordable and open, to facilitate mashups.

Ok, so with that sort of future vision for tracking, what do we need as a scholarly ecosystem need to power this future world?

We need innovation and experimentation.

We need 1000 flowers blooming.

We need solutions that are open and generative.

I don’t have all the answers, but here is part of it:

  • open access to citation data. We can’t just rely on Scopus, Thomson, and Google Scholar. Those are only three players, They good at what they do and have been invaluable, but they can’t possibly be as nimble as a whole bunch of startups. It is taking them a long time to come out with a data tracking tool. Why? Probably because they have an ambitious vision and need time to fit it into their other product offerings. Some of the rest of us would be happy with iterating on a quick and dirty solution. We need more competition in this space. The barrier to entry is extrodinarily high because of course reference lists are almost all behind copyright and paywalls…. but open access gives us a toehold.
  • open access to full text. Open access also gives us a toehold into citation context information. A citation to a dataset tells us that the dataset played some role in that new research paper. What role? Was it used to validate a new method? Detect errors? Was it combined with other datasets to solve a problem that was otherwise intractable? The answers to these questions are fundamental to what funders and others need to know about impact. It won’t be easy to derive them from the text of the paper, but I strongly believe it is possible.  We need this to be true open access — can be used by anyone for any purpose — none of this Non Commercial nonsense… we must allow startups to use this information if we are going to get the innovation we need.
  • open access to other metrics of use. We need broad-based metrics… not just citations, but blog posts about data, slides, tutorials that include R data, bookmarks to data on bookmarking sites. Altmetrics.  If you run a data repository, make your download stats publicly available. We frankly don’t know what all of this info means yet, but we didn’t know what citations to papers meant 50 years ago either. We’ll all figure it out, the more data the better.

Here’s what each of us need to do:

The future where data attribution Counts.

The future is about what kind of impact a dataset makes,
not just a citation number.

The future is open.

Open data.

Open data about our data.

December 3, 2011

thoughts on where journals are now, what to do next

Filed under: Uncategorized — Heather Piwowar @ 2:59 pm

With my Dryad hat on I was recently invited to participate in a “Future of Research Dissemination” day at BMJ.  Invitees were asked to give brief introductory remarks on what journals are doing now to enhance the experience of research and readers, what researchers and readers want, and then at the end what publishers ought to be doing now to future-proof themselves.

My take, fwiw, with some links to previous blog posts with more detail:

What are journals doing now wrt data?

Journals are increasingly recognizing that the datasets which support the findings in their articles are a crucial resource. They are working to make them more available and more useful.

There is increasing recognition that datasets are different than articles:

  • how they are peer-reviewed (or not)
  • how they are licensed
  • how they are discovered
  • how they are preserved
  • how they are financed

For all of these reasons, supplementary information is probably not the right place for datasets. Journals, along with others in the scholarly ecosystem (spurred on by recent requirements by funders for increased data availability, and evidence that researchers often don’t makedata available, including for example in cancer) are trying to figure out how best to move forward.

Data repositories are gaining traction as a best practice solution.

Some journals, like BMJ Open, integrate data submission with data repositories like Dryad to make things as easy as possible for authors and in some cases also peer-reviewers.

Journals are also reconsidering their own policies with respect to data, becoming more and more explicit about what is expected from authors.

There is repeated evidence (1, 2, 3, others) of a strong correlation between data archiving policies and impact factor; high IF journals are more likely to expect data to be publicly available, and indeed have been measured to have the highest rate of data availability.

Requiring data archiving in the current culture can feel daunting. One approach taken by the top-tier journals in evolutionary biology was to adopt a coordinated Joint Policy for Data Archiving. Starting in this last January all of those journals began requiring data archiving as a condition of publication simultaniously.

Finally, journals just in the beginning of ways to support synergistic discovery. Links between papers and data, full-text search because of course a paper is the best metadata for a dataset, and article open metrics of use are getting off the ground. It is crucial these are available for both humans and machines (apis) to enable innovative and meaningful use. Also it is key they are open. Google Scholar etc isn’t, can’t spider, can’t reuse, can’t mashup.

What do researchers and readers want?

Lots of things.  To pick one:  Recognition for the labour they’ve put in to creating data, and meaningful credit for anything built on top of it.

This is primarily a function for funders and institutions, but journals can play a unique role in making the appropriate credit explicit.

Citations to datasets is a start, but we must go further than that, because citations are too minor.  For example, we could ask authors what resources were essential to the research they are reporting and then revealing those debts in structured and open ways for remixing.

What steps should journals take now (in the next year or two) to future-proof themselves?

(I was one of the last ones in the room to chip in.  OA (and in particular proper OA without NC), article level metrics, collaborating with other publishers, extending open peer review, experimenting in general, adopting stronger data policies, etc had already been mentioned.

  1. open computer programming interfaces to full text search, open impact metrics, and deep metadata to facilitate external innovation
  2. software challenges for innovative applications, because those are the relationships you want to build
  3. signal the way to best practices, by asking reviewers if “all resources have been made appropriately available” and by leaving space on submission form for “dataset IDs”   (hat tip to John Wilbanks).  When in doubt with these tactics and policies be brave not conservative.
  4. experiment with new and more profound forms of acknowledgement for essential scholarly building blocks
  5. start practicing living within lower profit margins.  (!).

November 18, 2011

Doing data archiving well

Filed under: Uncategorized — Heather Piwowar @ 6:30 am

It is easy to think that archiving data is easy: just put the data files up on a website.  To do it well, though, isn’t that easy.  The Dryad digital repository has been thinking hard about these issues for years, working toward a practical, simple, and rewarding solution.  For Dryad’s website and promotional material we’ve articulated some of the issues we feel are important; see Why Should I Choose Dryad for the up-to-date version.

I copy the current text here to inspire a conversation about “selling points” for a data archive, and even more importantly illustrate how involved it is to make a data archive great.


November 9, 2011

designing an awesome total-impact api

Filed under: Uncategorized — Heather Piwowar @ 12:54 pm

APIs are awesome.  They let other people leverage your product and make unexpected things.  Total-impact exists because it could be built quickly on the APIs of others.

So total-impact itself should have an awesome and easy to use api.

We’ve made huge strides in this regard in the last few days.  We now have an api roadmap and have implemented the first part of it.  The total-impact web app will soon be doing all its data accesses through this api.  We really like saying we are built on our own api : )

Play around with the examples below and see what you think (please don’t use the api heavily or in production yet: it isn’t doing good caching, etc).  Suggestions for improvements in the design are very welcome!

Initial implementation includes:

  • GET /items/ID1,ID2,ID3 or GET /items/ID1,ID2,ID3.html
    • returns html for those IDs, as it would appear on the total-impact website.
  • GET /items/ID1,ID2,ID3.json
    • all metrics info in json format
  • GET /items/ID1,ID2,ID3.xml
    • all metrics info in xml format
  • GET /items/ID1,ID2,ID3.json?fields=biblio,aliases,metrics,debug
    • allows subsetting the metrics info returned

Examples:  (to try other IDs replace / in IDs with %252F)

Full roadmap in the works (feedback encouraged!)

October 31, 2011

more about total-Impact

Filed under: Uncategorized — Heather Piwowar @ 11:34 am

Want a bit more info on total-Impact?  Here’s the content of the about page text as it exists on October 31, 2011 to provide context for those of you who don’t usually click through blog links :)

It is early days.  See the bottom of the page if you have ideas, suggestions, or want to give us feedback!

  • what is total-Impact?
  • who is it for?
  • how should it be used?
  • how shouldn’t it be used?
  • what do these number actually mean?
  • what kind of research artifacts can be tracked?
  • which metrics are measured?
  • where is the journal impact factor?
  • where is my other favourite metric?
  • what are the current limitations of the system?
  • is this data Open?
  • does total-Impact have an api?
  • who developed total-Impact?
  • what have you learned?
  • how can I help?
  • this is so cool.
  • I have a suggestion!

what is total-Impact?

Total-Impact is a website that makes it quick and easy to view the impact of a wide range of research output. It goes beyond traditional measurements of research output — citations to papers — to embrace a much broader evidence of use across a wide range of scholarly output types. The system aggregates impact data from many sources and displays it in a single report, which is given a permaurl for dissemination and can be updated any time.

who is it for?

  • researchers who want to know how many times their work has been downloaded, bookmarked, and blogged
  • research groups who want to look at the broad impact of their work and see what has demonstrated interest
  • funders who want to see what sort of impact they may be missing when only considering citations to papers
  • repositories who want to report on how their research artifacts are being discussed
  • all of us who believe that people should be rewarded when their work (no matter what the format) makes a positive impact (no matter what the venue). Aggregating evidence of impact will facilitate appropriate rewards, thereby encouraging additional openness of useful forms of research output.

how should it be used?

Total-Impact data can be:

  • highlighted as indications of the *minimum* impact a research artifact has made on the community
  • explored more deeply to see who is citing, bookmarking, and otherwise using your work
  • run to collect usage information for mention in biosketches
  • included as a link in CVs
  • analyzed by downloading detailed metric information

how shouldn’t it be used?

Some of these issues relate to the early-development phase of total-Impact, some reflect our early-understanding of altmetrics, and some are just common sense. Total-Impact reports shouldn’t be used:

  • as indication of comprehensive impactTotal-Impact is in early development. See limitations and take it all with a grain of salt.
  • for serious comparisonTotal-Impact is currently better at collecting comprehensive metrics for some artifacts than others, in ways that are not clear in the report. Extreme care should be taken in comparisons. Numbers should be considered minimums. Even more care should be taken in comparing collections of artifacts, since some total-Impact is currently better at identifying artifacts identified in some ways than others. Finally, some of these metrics can be easily gamed. This is one reason we believe having many metrics is valuable.
  • as if we knew exactly what it all meansThe meaning of these metrics are not yet well understood; see section below.
  • as a substitute for personal judgement of qualityMetrics are only one part of the story. Look at the research artifact for yourself and talk about it with informed colleagues.

what do these number actually mean?

The short answer is: probably something useful, but we’re not sure what. We believe that dismissing the metrics as “buzz” is short-sited: surely people bookmark and download things for a reason. The long answer, as well as a lot more speculation on the long-term significance of tools like total-Impact, can be found in the nascent scholarly literature on “altmetrics.”

The Altmetrics Manifesto is a good, easily-readable introduction to this literature, while the proceedings of the recentaltmetrics11 workshop goes into more detail. You can check out the shared altmetrics library on Mendeley for more even relevant research. Finally, the poster Uncovering impacts: CitedIn and total-Impact, two new tools for gathering altmetrics, recently submitted to the 2012 iConference, describes a case study using total-Impact to evaluate a set of research papers funded by NESCent; it has some brief statistical analysis and some visualisations of the results.

what kind of research artifacts can be tracked?

Total-Impact currently tracks a wide range of research artifacts, including papers, datasets, software, preprints, and slides.

Because the software is in early development it has limited robustness for input variations: please pay close attention to the expected format and follow it exactly. For example, inadvertently including a “doi:” prefix, or omitting “http” from a url may render the IDs unrecognizable by the system. Add each ID on a separate line in the input box.

artifact type host supported ID format example
a published paper any journal that issues DOIs DOI (simply the DOI alone) 10.1371/journal.pcbi.1000361
a published paper PubMed PubMed ID (no prefix) 17808382
a published paper Mendeley Mendeley UUID ef35f440-957f-11df-96dc-0024e8453de8
dataset Genbank accession number AF313620
dataset PDB accession number 2BAK
dataset Gene Expression Omnibus accession number GSE2109
dataset ArrayExpress accession number E-MEXP-88
dataset Dryad DOI 10.5061/dryad.1295
software GitHub URL (starting with http)
software SourceForge URL
slides SlideShare URL ttp://
generic url A conference paper, website resource, etc. URL

Identifiers are automatically exploded to include synonyms when possible (PubMed IDs to DOIs, DOIs to URLs, etc).

Stay tuned, we expect to support more artifact sources soon! Want to see something included that isn’t here? See the How can I help section below.

which metrics are measured?

Metrics are computed based on the following data sources:

[the about page lists them but the list is too long for here.  See]

where is the journal impact factor?

We do not include the Journal Impact Factor (or any similar proxy) on purpose. As has been repeatedly shown, the Impact Factor is not appropriate for judging the quality of individual research artifacts. Individual article citations reflect much more about how useful papers actually were. Better yet are article-level metrics, as initiated by PLoS, in which we examine traces of impact beyond citation. Total-Impact broadens this approach to reflect artifact-level metrics, by inclusion of preprints, datasets, presentation slides, and other research output formats.

where is my other favourite metric?

We only include open metrics here, and so far only a selection of those. We welcome contributions of plugins. Your plugin need not reside on our server: you can host it if we can call it with our REST interface. Write your own and tell us about it.

You can also check out these similar tools:

what are the current limitations of the system?

Total-Impact is in early development and has many limitations. Some of the ones we know about:

Gathering IDs and quick reports sometimes miss artifacts

  • misses papers in Mendeley profiles when the paper doesn’t have a ID in the “rft_id” attribute of the html source.
  • seeds only first page of the Mendeley profile
  • Mendeley groups detail page only shows public groups
  • seeds only first 100 artifacts from Mendeley groups
  • doesn’t handle dois for books properly

Artifacts are sometimes missing metrics

  • doesn’t display metrics with a zero value, though this information is included in raw data for download
  • sometimes the artifacts were received without sufficient information to use all metrics. For example, the system sometimes can’t figure out the DOI from a Mendeley UUID or URL.

Metrics sometimes have values that are too low

  • some sources have multiple records for a given artifact. Total-Impact only identifies one copy and so only reports the impact metrics for that record. It makes no current attempt to aggregate across duplications within a source.


  • max of 250 artifacts in a report; artifact list that are too long are truncated and a note is displayed on the report.

Tell us about bugs! @totalImpactdev (or via email to

is this data Open?

We’d like to make all of the data displayed by total-Impact available under CC0. Unfortunately, the terms-of-use of most of the data sources don’t allow that. We’re trying to figure out how to handle this.

An option to restrict the displayed reports to Fully Open metrics — those suitable for commercial use — is on the To Do list.

The total-Impact software itself is fully open source under an MIT license. GitHub

does total-Impact have an api?

[Edited Nov 10/2011 to add: total-Impact now has an awesome API! More info.]

yes, kinda. Our plugins do, and you can query the update.php with a series of GET requests. Please don’t overload our server, and do add an &email=YOUREMAIL tag on so we contact you if necessary based on your usage patterns. This is still very new: don’t hesitate to get in touch to figure it out with us.

who developed total-Impact?

Concept originally hacked at the Beyond Impact WorkshopContributors. Continued development effort on this skunkworks project was done on personal time, plus some discretionary time while funded through DataONE (Heather Piwowar) and a UNC Royster Fellowship (Jason Priem).

what have you learned?

  • the multitude of IDs for a given artifact is a bigger problem than we guessed. Even articles that have DOIs often also have urls, PubMed IDs, PubMed Central IDs, Mendeley IDs, etc. There is no one place to find all synonyms, yet the various APIs often only work with a specific one or two ID types. This makes comprehensive impact-gathering time consuming and error-prone.
  • some data is harder to get than we thought (wordpress stats without requesting consumer key information)
  • some data is easier to get than we thought (vendors willing to work out special agreements, permit web scraping for particular purposes, etc)
  • lack of an author-identifier makes us reliant on user-populated systems like Mendeley for tracking author-based work (we need ORCID and we need it now!)
  • API limits like those on PubMed Central (3 request per second) make their data difficult to incorporate in this sort of application

how can I help?

  • can you write code? Dive in! github url:
  • do you have data? If it is already available in some public format, let us know so we can add it. If it isn’t, either please open it up or contact us to work out some mutually beneficial way we can work together.
  • do you have money? We need money :) We need to fund future development of the system and are actively looking for appropriate opportunities.
  • do you have ideas? Maybe enhancements to total-Impact would fit in with a grant you are writing, or maybe you want to make it work extra-well for your institution’s research outputs. We’re interested: please get in touch (see bottom).
  • do you have energy? We need better “see what it does” documentation, better lists of collections, etc. Make some and tell us, please!
  • do you have anger that your favourite data source is missing? After you confirm that its data isn’t available for open purposes like this, write to them and ask them to open it up… it might work. If the data is open but isn’t included here, let us know to help us prioritize.
  • can you email, blog, post, tweet, or walk down the hall to tell a friend? See the this is so cool section for your vital role….

this is so cool.

Thanks! We agree :)

You can help us. We are currently trying to a) win the PLoS/Mendeley Binary Battle because that sounds fun, b) raise funding for future total-Impact development, and c) justify spending more time on this ourselves.

Buzz and testimonials will help. Tweet your reports. Sign up for Mendeley, add public publications to your profile, and make some public groups. Tweet, blog, send email, and show off total-Impact at your next group meeting to help spread the word.

Tell us how cool it is at @totalImpactdev (or via email to so we can consolidate the feedback.

I have a suggestion!

We want to hear it. Send it to us at @totalImpactdev (or via email to Total-Impact development will slow for a bit while we get back to our research-paper-writing day jobs, so we aren’t sure when we’ll have another spurt of time for implementation…. but we want to hear your idea now so we can work on it as soon as we can.

What total-Impact brings to the party

Filed under: Uncategorized — Heather Piwowar @ 11:31 am

As I mentioned in a previous post, I’ve been one of several people working on an app called total-Impact.  Total-Impact is in early alpha release: you can play around with it or check out my report, for fun:

Several similar applications are emerging at the same time, all with the goal of making the impact of scholarly work more easily accessible and actionable.  I think lots of the other apps are great and excel in specific areas.  Here’s my take on what total-Impact brings to the party that most of the others don’t, yet:

nontraditional research objects

All types of scholarly products — datasets, code, slides, preprints, videos, slides, etc — can be tracked into the blogosphere, the bookmarksphere, and the popular press.  Total-Impact includes them in our tools Right Now.  

Sure, policy change has to come from the top down to recognize the importance of these other forms of scholarly products.  But this top-down change can be prompted by change from the bottom up: tools that help early-adopting scholars include these objects on our CVs (with context!) will demonstrate support for valuing these products and begin mainstream adoption.

For example, I include my datasets in my CV.  Dryad displays usage information on its data and this is collected by total-Impact.  Furthermore, total-Impact looks for dataset dois in ResearchBlogging posts and PLoS code, so those types of impact are included too!  (Note in this case the blog post and PLoS mention were in artifacts written by me: self-citations.  If I did this too often, it would clearly look bad, but doing it once or twice is responsible research dissemination, as is true for traditional article citations.)

Total-Impact facilitates drilling down into the source of the metrics whenever supported by the data providers.  For example, want to know who watches the total-Impact code base?  Look at the “software” section of my report, click on watchers, and it will bring up the gitHub page:

A useful way for me to find developers with similar interests!  Ditto which groups have bookmarked my papers on Mendeley, etc.

diverse collection types

ok, another thing that total-Impact brings to the party is that its “collection” focus is very general.  Collections can be about people (like my report above), but they can also be based on a generic Mendeley Group about a topic or the output of a research group (see report for the Mendeley group Future of Science) , aggregated based on a Grant Number in PubMed (see report for papers tagged with the grant U54-CA121852 in PubMed), or built from any other collection of IDs an individual or organization wants to assemble.  Powerful.

There are other things that total-Impact brings too, but tracking Diverse Objects and Diverse Collection Types (for lack of better terms) are two perspectives that I hope are soon ubiquitous.

The promise of another open: Open impact tracking

Filed under: Uncategorized — Heather Piwowar @ 11:28 am

I’ve been thinking a lot recently about a scholarly Open that hasn’t gotten much attention yet:  Open impact tracking.

Impact tracking:  Usage data from publishers + websites + apps on the objects they host.  Downloads and views, but also bookmarks, discussions, posts… indications that people have interacted with the objects in some way.

We all know that companies value this information when the digital objects are pointers to consumer products: who is talking about the product?  How many people are talking about it?  What are they saying?  What does it mean?

Now imagine that the digital objects are scholarly products.  Papers, preprints, datasets, slidedecks, software.  Don’t we still want to know who is interested?  How many people are interested?  What they think, what they are doing with it, whether it is making a difference in their own related work?

Yup, as scholars and people who fund and reward scholars, we certainly do want to know those things.

We want to know the numbers, and we want to know the context of the numbers.  Not so we can overinterpret them as the end-all-and-be-all of an assessment scheme, but as insight into dimensions of impact that are totally hidden when we focus on pre-publication metrics (particularly the totally-inappropriate-for-article-level-assessment Journal Impact Factor) or even just the single dimension citation tracking.

PLoS has led the way: since 2009 PLoS has been collecting and displaying Article-Level Metrics for its articles.  Jason Priem and others have articulated the promise of altmetrics and begun digging into what these metrics mean.

Over the last few months I’ve been having a great time hacking on an app that reveals open altmetrics stats (and their context) for diverse research products.  total-Impact started in a 24-hour hackathon at the Beyond Impact workshop funded by the Open Society Foundations.  Since then a few of us have been unable to put it down.  I’ll talk about it a bit more in a future blog post [added link, also see here], but you are welcome to read more and play around with the alpha release now!

The time is clearly right for this sort of app… several similar ones are emerging now too.

In this post I want to highlight one thing about this space:

Impact information should be Open

The source data for scholarly research impact metrics should be Open.  Open facilitates mashups.  Open enables unexpected use, from unexpected places.  Open lets the little players in and brings the innovation.  Open permits transparency to detect problems.

Total-Impact got going in large part because PLoS and Mendeley have APIs which make their impact data freely and openly available.   Some publishers and websites do the same (or at least display their usage data on webpages and permit scraping) — but most don’t.  Why?

  1. It costs money, a rep from a Very Big Publisher told me last week.  Yup.  But these days not that much money.  This isn’t the beginning of Citation Counting when it was all manual and the only choice was to charge money.  This is routine web stuff.  Consider it one of your publishing costs, as PLoS does.
  2. It can be gamed, we don’t know what it means, it might send the wrong message.  Ok, yes.  But we are using it right now anyway, with all of those “Highly accessed” badges and monthly emails to authors.  The difference?  The data isn’t openly available for analysis and critique and deep understanding and improvement.  I say: open up your data, say what it means and what its limitations are, and work toward standards.
  3. Privacy.  For sure, don’t do things that would make your service users mad.  But that leaves a lot of room for sharing some useful data.  Aggregate download stats, maybe some breakdowns by date or geography or return visitors.  Drill-down to reviews or publicly-available details.  Here are a few of the sources doing it… you can do it too.
Note that I’m not advocating that all *uses* of impact information should be Open.  That has advantages, sure, but so does making money.  Making money is important: people who add value through interpretation should be able to be rewarded for that.  But the raw data that backs them up?  Open.
This means:
  • open usage stats.  Views and downloads of scholarly research products over time, number of bookmarkers, etc.  This means publishers and institutional repositories and data hosts and blogging platforms and value-add services.
  • open full text queries.  This doesn’t require OA: Google Scholar allows full text queries into research articles.  Unfortunately Google SCholar doesn’t allow using its information in an automated fashion.  Publisher websites could allow this, ideally through an API.  PubMed Central is a leader here, with eUtils (though its 3 queries/second limit prohibits a lot of useful applications).
  • open reference lists.  You know how abstracts are “open”… or at least free?  If reference lists were also in front of the pay wall and available for aggregation we could have a lot more players in the citation aggregation space, and more agile innovation than Web of Science+ Scopus + Google Scholar alone can provide.  Again PubMed Central is a leader here in making citation information Open through its eUtils api.
Let’s make it clear that we expect Open access to data demonstrating our impact.
Toll-access to articles limits what we scholars can do with aggregated scholarly work. So too, hidden and toll-access to impact information has implications for how we as scholars can filter, navigate, understand, and interpret scholarly work.  It matters.
ETA: link to related blog post

August 18, 2011

my #scifoo

Filed under: Uncategorized — Heather Piwowar @ 2:12 pm

I was lucky to be one of 300 scientists, science educators, science publishers, and science writers who descended on Google last weekend.  #scifoo is an annual event hosted by Google, Nature, and O’Reilly: a hothouse for ideas and collaborations across science.

It sure worked for me.  Buzzing, buzzing with ideas and new people I’ve met and opportunities and future conversations.  I’m still black and blue from pinching myself to prove the conversation I just had was real.

I do hope that #scifoo is a conference of the future.  I gather that a lot of work goes into pulling the right group together, so I’m not sure how scalable it is.  But it sure makes great use of face-to-face opportunities, something desperately lacking in our current conference-paper culture.

You may have noticed that there wasn’t much #scifoo tweeting or blogging while it was going on.  That isn’t because it was discouraged per se.  If anyone wanted something off-record they could just make that clear at any point, and we agreed to respect it under a “FrienNDA”.  No, the reason there wasn’t much external commentary during is that we were all too busy participating.  The groups are small, the conversations intense, the hours long…. no time to tweet.

(Also, I think it was group that was pretty light on blogging and tweeting behaviour in general.  18/270 (7%) listed blogs on the blog page wiki before the conference and 32/270 (12%) listed twitter handles in their brief bios, as suggested.  oh!  just found 35 on lanyrd )

So now a summary, eh?  Not sure it is possible but I’ll highlight a few things.  I’m keeping names out of it because don’t know who wants to be blog-google-findable about what and I tend to be conservative about that…if something is interesting, ping me and I’ll make the connections.

Some of my highlights:

Sessions I attended:

  • Fascinating session about science revolutions, when they happen easily vs with difficulty.  Wish this one had been longer, it was just starting to get meaty.
  • Really interesting session about tradeoffs between resiliency and efficiency.  Lots of use cases, esp economy and environment, but I was mapping it my head to research progress.
  • Talking impact beyond the impact factor with someone who has every reason to love the impact factor but wants something better
  • Open Science, Open Data (co-led), Open Protocol Database, lots more….  we had good sessions around these areas.  So help me I’m not able summarize them right now.  Discussions about differences in disciplines, whether publishing blog posts in journals is problematic, what attributes of the Inglefinger rule are worthwhile, how data can improve discoverability and links between our research outputs….

Because my knitting, open data, isn’t really a field per se I spent a bit of extra time hanging out with people talking about this… the chances for me to do this in person are few and far between so it was fantastic.  This did come at the expense of going to many of the wild sessions on time, space junk, oxytocin, DIY nurse kits.  Drat!

Personal conversations:

  • Bus ride with someone from the National Research Council Canada about how to work toward widespread policies for open research data
  • Bus ride with someone from Microsoft Research who came at issues familiar to me from an entirely different perspective.
  • Met one of the authors of an early report on Data Sharing, published before publishing on Data Sharing was cool :)
  • Science education!  Wow there are some great movers and shakers in education.  Thank goodness.
  • A strong desire to find someone willing to model the impact of the submission+publishing delay on medical progress.  [Any takers?  Run with it and cc @sommerjo]
  • Yeah…. I’m not going to try to list them all.  There were a whole bunch of cool people doing cool stuff, working to improve the world as they saw it.  Evidence charts, rare diseases, publication bias and data withholding, reproducibility, discoverability, software collaboration, postdoc forums, indebtedness cultures, science exchange, citation policy, citizen science.  I have academic crushes on their work, and look forward to watching where it goes.  [if this is your stuff and you want your name on it here, add to comments or let me know and I’ll link to you]

Last but not least, one of the great things was to meet again — or in person for the first time — a few acquaintances who are now one step closer to friends.

Immediate impact for me:

  • We’re pitching a conference to O’Reilly: Research into More Effective Research.  Your ideas wanted!
  • I now know some science journalists and they know me.  This is big, because I want my research to make a difference
  • Gave a demo of our total-impact system and got some feedback and good leads.  It needs developers/new ringleaders, ping me if interested!
  • Was reminded that the Citizen Science Alliance is awesome and that I want to write up a proposal to leverage that power
  • Lots more twitter buddies, and more faces to go with the twitter buddies I already have

Long term impact:

Check back with me later :)

Conference pitch: Research For More Effective Research

Filed under: Uncategorized — Heather Piwowar @ 9:30 am

A bunch of us at #scifoo this weekend realized we were reinventing wheels.  We were each doing grass-roots research into ways to make science more efficient and effective.  We were studying related topics in similar ways without knowing about each other… and we had lots of notes and lessons-learned to share.

Furthermore, we were gathering evidence about efficient and effective research methods and tools and practices but weren’t very hooked in to people and organizations who could run with these results and tools to help make a difference.  For that matter, people in decision-making roles often don’t know we exist and that as investigators we want to hear from them and work with them and solve the same problems.

We decided we needed a community and a conference, tentatively dubbed Research for More Effective Research.  Coincidentally, #scifoo is co-hosted by O’Reilly, and they actively want to hear about new conference ideas.  We gave them a quick informal pitch.  They were interested, and asked for a more detailed summary, including possible sponsors and sources of possible participants.

Here’s a first draft, pulled together quickly.  Please help!  I’ve certainly omitted groups, expressed things poorly, etc.  It needs input from the community, so I’ve put the text up on this Open Google Doc.  Please go here and add/edit/revise.  I’ll repost the revised version here in a few days and send it to O’Reilly.

If you want to go on record as supporting this pitch please add your name to the Google doc (not necessary, just if you want).

The goal here isn’t to step on toes.  If you know some other group with this mission, add it and raise it and let’s solve the problem together.

fwiw I’ve got energy to run with this draft and send the initial email to O’Reilly, but after that I need to bow out and work on previous commitments.  This idea will need several additional champions to get off the ground.  Speak up if that sounds like you!

Hmm, let’s start with a hashtag to build community.  I’m bad at picking.  Suggestions?


Research matters.  Research fuels innovation, cures, understanding, and inspiration.

Unfortunately, our research systems haven’t kept up with changing culture and technology.  Our structures, infrastructure, norms, and reward systems are not well aligned with efficient and effective research progress.  There is a lot of room for improvement.

As people who believe in the power of research, we can fix this.  We can study what would work better, and drive towards evidence-based policies and implementation.

This sort of “research for more effective research” is already being done in several scattered areas, but it suffers from a lack of broad community and infrastructure for action.  Bringing together investigators, domain researchers, funders, publishers, educators, tool-builders, and experts in cultural change would allow exchange of methods, better understanding of which problems are most pressing, and support for making a difference.

We suggest an annual conference on “Research for more Effective Research” with tracks for discussion of Methods, Findings, and Implementation.

Topics could include, for example:

  • time from discovery to publication
  • publication bias
  • peer review burden
  • journal subscription cost
  • availability of text, data, material for reuse
  • reproducibility
  • diverse and broad participation
  • next generation research output format

Many conferences, journals, and organizations have related goals and would likely be interested in the opportunity to partner with others for deep analysis and *implementation* of important policies, processes, and tools to make research more efficient and effective:

A number of organizations might be interested in sponsoring such a conference.  Many research publishers have taken an active interest in cutting edge improvements, including Nature, PLoS, and Elsevier.  Funders, both government and charities, want the most research bang for their buck.  Other companies have demonstrated they want to help meet research needs, including BMC, Mendeley, Thomson Reuters, Digital Science, F1000, Springer, and many others.

In summary, we think this is an opportunity for a conference that matters, in an area with a scattered community already working in the area, and support from stakeholders sponsors.  We hope that O’Reilly — or someone else! — will help make this a reality.

August 17, 2011

OA doesn’t cost jobs: it creates them. And saves lives.

Filed under: Uncategorized — Heather Piwowar @ 4:44 pm

I was contacted today by someone evaluating the NIH Public Access policy.  In his introductory email he discussed the context for the analysis, including the “claim by proprietary scientific and technical publishers that they have been damaged by the policy and that extending it to other government funded research would cause a considerable loss of jobs—a powerful claim in the current economic and political environment.”

Thinking about this claim has made me so mad I have to blog it out.  (To be 100% clear:  I’m not angry at the person who contacted me, I’m mad at the claim).


Posting papers in PubMed Central is going to cost jobs, eh?  I guess subscriptions could go down if people are willing to wait 12 months to read cutting edge research, then publishers might lose money and lay people off.  Not of course the authors or the peer reviewers, since they aren’t paid, but the editorial staff and copyeditors and news writers and graphic designers and system admins….  Ok, I don’t want those people let go, it is true.  How many jobs do we think will be lost this way?  Are we REALLY SURE their contribution to the direct advancement of science can’t be saved some other way, through some other publishing model?

Let me tell you about all the jobs we lose, or fail to have created, since we don’t have Open Access (a step beyond Public Access) to our research articles.

  1. the jobs of people who might do text mining of research studies to IMPROVE SCIENTIFIC PROGRESS
  2. the jobs of people who might build tools on top of all scientific research articles to improve discovery and thereby IMPROVE SCIENTIFIC PROGRESS
  3. the jobs of the people who might mine datasets TO IMPROVE SCIENTIFIC PROGRESS but they can’t, since the data aren’t deposited anywhere, because our way of rewarding authors who archive data is broken, in part because we don’t have access to article text and reference lists and systems that record usage statistics

I will admit that in the current system we gain job openings that we might lose under OA — jobs opened up when people die because our scientific progress isn’t fast enough to save their lives.  Just in case we want to include that in the calculation.

Publishers complaining about jobs?  Find a different publishing system.  One that helps not hinders efficient and effective research progress.


You know that whole bioinformatics industry?  It would still be in the dark ages if people hadn’t made their data open.  There is every reason to believe that opening up the scientific literature will spark the same revolution.  In jobs, and in lives.  Count what matters.

Edited to acknowledge greater respect for jobs in the publication industry.

« Newer PostsOlder Posts »

Blog at