Research Remix

April 19, 2012

Do we need a text-mining manifesto?

Filed under: Uncategorized — Heather Piwowar @ 9:16 am

Manifestos are all the rage.  They have advantages: manifestos briefly summarize what people expect and thereby facilitate focused action.  Recent manifestos, boycotts, and statements of principle in scholarly communication have articulated various aspects of what researchers expect researchers to do (Panton Principles, The Cost of Knowledge, Science Code Manifesto).

The time has come for researchers to clearly state how we expect to be able to use the already-published literature.  Here’s a start:


We, researchers of the world, expect to be able to *use* the research literature to which we have access.

  • We expect to access and process the full text of the research literature with our computer programs
  • We expect to disseminate aggregate statistical results as facts and context text as fair use excerpts, openly with no restrictions other than attribution
  • We expect these rights without further cost when papers are accessed through researcher-provided tools, and with (at most) a transparent per-api-call fee when accessed through publisher-supplied programmatic interfaces

Publishers who facilitate these terms now in all subscription agreements (or for everyone, where appropriate): PLoS, BMC, other pure OA publishers.

Publishers who facilitate these terms now in subscription agreements with some institutions: Elsevier.


Note these expectations are restricted to literature to which scholars have access, through subscription agreements or OA.  Dissemination expectations are also limited to what is already allowed for facts and fair use.  This is not a trojan horse for OA.

In case you weren’t aware: amazingly most standard subscription agreements currently prohibit the actions described above.

Thoughts?  Suggested improvements?


  1. This is a terrific idea, Heather. However, I would not suggest volunteering to pay a per-api-call fee. Large commercial publishers are getting more than enough revenue to cover the costs already. Plus, if hits on the publishers’ server are an issue, they can move to business models permitting full local loading to avoid taking on these costs themselves. For example, in Ontario and elsewhere publisher content is housed on the Scholar’s Portal and accessed there, not the publishers’ site.

    Comment by hgmorrison — April 19, 2012 @ 3:53 pm

  2. […] more restrictive than Elsevier right now???) do not hold the cards.  We do.  Go out there and get thee some text-mining rights too.  Make all negotiations public.  Let’s do this. Share […]

    Pingback by text-mining is the new front, ready to escalate issues triggered by RWA debacle « Research Remix — April 20, 2012 @ 6:17 am

  3. Nice work, Heather P.!

    I agree with Heather’s M.’s suggestion.

    I also suggest that you make this important bit more prominent and say it earlier: “Note these expectations are restricted to literature to which scholars have access, through subscription agreements or OA”. It’s very important. We’re not asking for anything we’ve not already paid for. We’re just saying we want to know that we can exercise our fair-use rights without being hassled.

    Comment by Mike Taylor — April 20, 2012 @ 7:08 am

  4. Thanks for comments. Keep them coming.

    Another point worth clarifying: stating the rights we expect is its own task, and separate from choosing a path to establish/embrace those rights. This post is about the former, simply stating the rights we expect.

    I try to make that more clear in this post:

    “Elsevier and other publishers (who wants to look more restrictive than Elsevier right now???) do not hold the cards. We do. Go our there and a) assert you have these rights and begin exercising them, or b) negotiate for them to be explicitly included in contracts, depending on your views about the best way forward. Make all decisions (and negotiations, if any) public. Let’s do this.

    ETA: clarification that asserting and negotiating are two different paths. The articulation of what rights we expect is the same regardless of path. Thanks to Peter Murray-Rust and Ross Mounce for championing the “assert your rights and go for it” path.”

    Comment by Heather Piwowar — April 20, 2012 @ 7:17 am

  5. This is excellent, Heather: a really clear and important distinction. Maybe you should redraft the manifaesto now?

    Comment by Mike Taylor — April 20, 2012 @ 7:33 am

    • Good suggestion. I need to think harder (and ideally talk to more people) about dropping the per-api-call bit. I don’t necessarily agree that costs for text mining should be covered by subscription fees and not all publishers have astronomical profit margins. Also I don’t know the best way forward with it. Simply redrafting, or something more? I don’t think I have time for more… If someone else wants to run with this in some way then please do!

      Comment by Heather Piwowar — April 20, 2012 @ 7:42 am

      • Well, you certainly don’t need to talk about a *per-API-call* cost — that is nailing down an implementation detail that we just don’t know or care about. If you want to raise the idea of supplemental charges at all (I wouldn’t, but your call) you just say something about a service fee to cover the costs of obtaining the files.

        Redrafting shouldn’t take you long, if you keep it short (which you should *definitely* do anyway, if you want anyone to read it.) Maybe make a WordPress “page” for the manifesto rather than posting the new version in a new post, and update it progressively?

        Comment by Mike Taylor — April 20, 2012 @ 7:49 am

      • My day job is negotiating electronic resource licenses for libraries, and I highly recommend not saying anything about being willing to pay extra. Companies like Elsevier (I don’t work with them directly) are not shy about asking for more money. When the people we represent suggest paying more, this puts us in a weak negotiating position.

        Unless you’d like to donate your future academic salary to the cause of ensuring ongoing high publisher profits? Money for academic salaries and money to feed the likes of Elsevier all comes from the same pot – something we don’t think about as much as I think we might.

        Comment by hgmorrison — April 20, 2012 @ 8:47 am

  6. Heather, this is awesome, but I’m worried that many researchers might feel that as they themselves don’t text mine, this isn’t of interest to them. You’ve eloquently argued why this isn’t the case before (i.e. recent data citation post), but I think the first person plural suggests this. I think we need a statement that says we, all kinds of researchers, want our work *to be accessed and processed inf full text by computer programs”. Saying “our computer programs” sounds a bit too much like this is the text-mining community trying to speak for everyone, instead of everyone trying to speak for the text mining community.

    Like you have said before, probably most researchers stand to benefit from others text-mining their work and thus adding value to it, and I think the manifesto is most powerful when it comes from that passive benefit perspective, not the perspective of those doing the mining. I’m not usually an advocate of the passive voice, but here I think “we expect the literature to be used” speaks louder than “we expect to use”, etc. What do you think?

    Comment by Carl Boettiger (@cboettig) — April 20, 2012 @ 9:50 am

    • excellent point.

      Comment by Heather Piwowar — April 20, 2012 @ 10:06 am

      • I agree with this, and wonder if it would be wise to bring in PMR’s point about asserting our rights rather than requesting. My suggestion to write a next draft was probably premature – looks like you’ve got a good conversation going here Heather and it would be good to hear more voices before doing a next draft.

        Comment by hgmorrison — April 20, 2012 @ 10:18 am

  7. […] Another controversy, this time around text mining, is brewing in the background, and could possibly further escalate the issues triggered by the RWA […]

    Pingback by How Elsevier can save itself, part 0: Introduction « Sauropod Vertebra Picture of the Week #AcademicSpring — April 20, 2012 @ 4:10 pm

  8. I’d like to say that not only full text of scientific papers should be available under a reusable license such as CC, but the format is also important. Since the PMC open access subset is distributed in an XML format, it is much easier for text-miners to process a bunch of full text papers.

    Comment by Yasunori Yamamoto — April 27, 2012 @ 4:21 am

  9. Have you sen Force11 Manifesto – goes pretty far in language of text mining etc –

    Comment by Robert H. McDonald (@mcdonald) — May 9, 2012 @ 12:48 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: