Why may Google textmine but Scientists may not?

March 13, 2013

Why may Google textmine but Scientists may not?

Filed under: Uncategorized — Heather Piwowar @ 1:50 pm

I recently posted about why Google is not a good enough solution for searching the academic literature (because can’t build on the results! and read the comments on that post for more).

It is sad indeed, then, that PMC and Publishers forbid scientists and others from spidering/indexing/mining their content…. while giving Google privilege to do exactly this.

Check out the robot.txt files for PMC for /pmc/articles/ and notice that GoogleBot is allowed, Bing and a few others are allowed, but User-Agent:* (the rest of us) are not. The same is true for ScienceDirect robots.txt: Google may textmine everything, experimenting scientists, nothing. (hat tip to Alf Eaton on twitter)

Is this defensible on the grounds that Google knows what it is doing but The Rest Of Us Can Not Be Trusted? I sure hope not. Scientists are routinely trusted with a lot more than writing a script that won’t bring down a server. There are other ways to ensure someone won’t bring down a server than a global robots.txt ban.

Perhaps a ban is the only way to prevent unauthorized redistribution of large numbers of papers gathered via spidering? Nope. Require people to register. Monitor use. Clearly state what may be redistributed, what may not, and what actions will be taken if people behave badly.

Maybe they are just waiting till Scientist-initiated indexing projects gets Big and Important and Ask Nicely and then they will write them in as an allowed user. Maybe. But restricting play and experimentation is a pretty poor way to bring about that future and we should not accept this as the default behaviour of the keepers of our scientific literature.

PMC calls its prohibition against bulk downloading a “copyright” issue. That doesn’t make any sense to me. Sounds much more like a Terms of Use issue than a copyright issue. Am I wrong? If so, educate me in the comments. If I’m right, then I think we should ask PMC to change its wording because calling this a copyright issue just muddies already muddy waters.

It does appear to be, at least in part, a contract issue. In the contract between publishers and PMC (http://t.co/EhZP5SrS1i point 16, ht again to Alf Eaton), PMC volunteers in its terms that PMC will prohibit bulk downloading. Why does PMC include this sentence? Is it part of the NIH Public Access law that PMC has to include this sentence? If not, isn’t it capitulating an awful lot to publishers… basically undermining the ability for scientists to build enhanced searching tools, etc?

(and, how, given this, does Google get access? Don’t get me wrong. I think Google is fantastic! I want Google to keep having access! I just want all responsible systems to have the possibility of the same access to our publicly funded and hosted research, so that someone will build infrastructure that properly supports research and research tools.)

Anyway, these spidering policies strike me as unfair, and something that people should be talking about and complaining about and doing something about, especially as we start to craft new policies for how people and computers can access our Public Access research output under the new OSTP policy.

Comments (23)

23 Comments

My reading of the PMC copyright notice http://www.ncbi.nlm.nih.gov/pmc/about/copyright/ is that unless journals have an explicitly “open” license (e.g., PLoS and BMC journals) PMC treats them as being under copyright and hence you can’t mine them. Inclusion in PMC isn’t the same as being open (in the same way that “free” isn’t the same as “open”). For example http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1688699/ is freely available as a PDF on PMC, as well as the Royal Society’s site http://dx.doi.org/10.1098/rspb.1997.0199. But the PDF is under copyright (see http://bit.ly/XyPdyY for details on pricing).

Comment by Rod Page — March 14, 2013 @ 6:56 am
- Thanks for the comment. Yup, right on about the difference between free and open, this is important and often missed.
  
  The point I’m trying to make here is different, though. Here’s the sentence I’m questioning: “Bulk downloading of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.”
  
  “Bulk downloading” doesn’t have to do with copyright restrictions, does it? (Bulk downloading isn’t necessarily about textmining, the “bulk” selection could be across different publishers/copyright holders, etc) I think it is sloppy language that confuses the issues and propagates misunderstanding.
  
  Comment by Heather Piwowar — March 14, 2013 @ 7:02 am
  - One concern publishers may have is what happens to content once it’s downloaded. Google doesn’t redistribute content it indexes, but others might. Once it’s downloaded publishers have little control over what happens to it. And imagine the hassle for PMC of figuring out all the permutations of licenses that publishers may have used if it were to permit bulk downloads. I suspect, in the end, it’s always easier to say “no”.
    
    Comment by Rod Page — March 14, 2013 @ 7:13 am
  - Agreed w your follow-up comment that it is easier to say no. But it is our job to push back on the unnecessary and harmful Nos, and demand accurate explanations of imposed restrictions.
    
    Comment by Heather Piwowar — March 14, 2013 @ 7:16 am
  - If publishers are concerned that content would be redistributed, the terms of use should say “you’re not allowed to redistribute this content” – restricting distribution is not an appropriate way to restrict redistribution.
    
    Comment by alf — March 14, 2013 @ 9:15 am
  - +1 Alf: “restricting distribution is not an appropriate way to restrict redistribution“
    
    Comment by Heather Piwowar — March 14, 2013 @ 11:12 am
Dario Taraborelli ‏@ReaderMeter
@researchremix excellent post, have you tried to obtain directly from PMC an explanation of the crawler restriction waiver for Google?
7:25 a.m. – Mar 14, 2013 · Details

Reply to @ReaderMeter
2 mins Heather Piwowar ‏@researchremix
. @ReaderMeter Nope: writing a blog post is quicker than starting conversation, I truthfully+shamedly admit. Someone grab baton from here?

Comment by Heather Piwowar — March 14, 2013 @ 7:30 am
This is a sticky wicket for those of us who manage IRs, too – if a library’s license for AIP journals (for example) prohibits bulk downloading, but AIP as a publisher permits authors to post the final PDF versions of their articles in their IR, does this prohibit the IR manager from downloading all their faculty’s articles for the purposes of deposit?

Comment by hillary (@_hillary) — March 14, 2013 @ 11:38 am
- I’d definitely err on forgiveness rather than permission in that case, esp if bulk isn’t defined.
  
  Comment by Heather Piwowar — March 14, 2013 @ 11:44 am
  - I agree with you, but I know of at least one major university library that does not do bulk deposits of articles where their licenses include a prohibition of that nature, for fear of violating the TOS.
    
    Comment by hillary (@_hillary) — March 14, 2013 @ 11:51 am
  - Major university libraries need to stop living in fear of violating the TOS…. their fear avoidance has hurt and is hurting our research infrastructure. Risk it or get clarity then fight for it, but stop fearing it already.
    
    Comment by Heather Piwowar — March 14, 2013 @ 11:53 am
  - IRs’ tendency to live in fear of publishers is one more reason why I am skeptical of their ability to get the job done on OA. It needs every institution to independently grow a pair. What are the chances of that happening?
    
    Comment by Mike Taylor — March 15, 2013 @ 5:03 am
- It is also a question for IRs from the other side…. does your IR allow everyone to spider/index the papers housed in it, or only Google etc?
  
  Comment by Heather Piwowar — March 14, 2013 @ 11:46 am
  - In our case we do expose our repository data through an OAI-PMH gateway. Does that change the answer, I wonder? Or is it okay since we’re not exposing the entire corpus of a publisher’s articles but rather just those with our affiliated authors?
    
    Comment by hillary (@_hillary) — March 14, 2013 @ 11:50 am
@researchremix Can’t possibly have anything to do with copyright. Copyright has nothing to do with the mechanism or scale of fetching.

— Steve Pettifer (@srp) March 14, 2013

Comment by Heather Piwowar — March 14, 2013 @ 11:41 am
Transparency from PMC would help. Which publishers demanded what? What do the contracts actually say?

Comment by dsalo — March 14, 2013 @ 2:34 pm
BTW., you are exactly right that TOS issues are completely separate from copyright issues. The only thing they have in common is a tendency to stop people form doinj useful things. It terrifies me how consistently people who should know better (government agencies, broadsheet newspapers, universities) are deeply confused about IP issues, and routinely conflate copyright, TOS, patents, trademarks and more. The ignorance is astonishing; what hurts is that it’s sometimes the people making the decisions that are ignorant.

Comment by Mike Taylor — March 15, 2013 @ 5:05 am
This blog post has been quoted in a Nature News article about text mining by Richard Van Noorden:

http://www.nature.com/news/text-mining-spat-heats-up-1.12636

Comment by Heather Piwowar — March 20, 2013 @ 11:32 am
- I asked PMC about this. From Ed Sequiera: ‘The questions you raise are best left to the lawyers for publishers and interested users of the content. We don’t try to make those interpretations of the copyright laws. When we get requests for text mining or other automated uses of articles outside the PMC OA subset we will give users the data if they get permission from the respective publishers.’ He added, on follow-up: ‘To be more specific, the publishers that have opted out of the PMC OA subset won’t allow automated downloading in general – presumably because they can’t control what people will do with the content; we obviously haven’t asked every publisher for its reasons. They’re ok with the content being indexed by the primary public search engines.’
  
  Comment by Richard Van Noorden (@Richvn) — March 20, 2013 @ 3:37 pm
How much do you trust Google? Millions of people trust Google enough to let them store their emails. Would you trust a random academic research group as much? Maybe some, but not all.

At the moment, publishers don’t trust most researchers enough to let them store their (paywalled) articles. But this might change. Maybe with some compromise: Elsevier for instance, gives some researchers access to XML files, which are pretty difficult to read for a human and lack figures and decorations, but contain the bulk of the text.

It all could be very easily solved: Publishers could setup a computer where researchers can run their software on the files without ever getting direct access to them, only to the results. But it would require some dedication to build this.

Comment by maximilianh (@maximilianh) — March 20, 2013 @ 1:08 pm
- You’re quite correct, Max, from the point of view of the work you’ve been doing: for paywalled articles to be handed over in bulk to someone, they have to be trusted to not redistribute that content. This isn’t about those articles, though – this is about articles that are *freely available online for anyone to read*; the only restriction is the terms of service which says you’re not allowed to download them using any automated process.
  
  Comment by alf — March 20, 2013 @ 4:30 pm
- Max, this is a very different issue from trusting Google with email. Email contains sensitive personal information. These articles are there to be read. As Alf says, this is even an “open” (free) subset. What you’re really saying is that publishers think researchers might take these “open” articles and somehow take publisher income away, perhaps by selling them. Or maybe they think other publishers will take their free/open articles and make money by republishing them! The more you think about this prohibition, the more nonsensical it is.
  
  Comment by Chris Rusbridge — March 21, 2013 @ 12:59 am
Another repository that does exactly this is arxiv.org. Please see their robots.txt http://arxiv.org/robots.txt . I contacted them about this but received absolutely no response. We have also been banned for bulk downloading from PMC :) I suggest that repositories with these policies should not be referred to as Open Repositories. It is very tragic if open access to content means open access to content for Google only. I have drafted a paper about this issue, which discusses the details. It is called “From Open Access Metadata to Open Access Content” http://core-project.kmi.open.ac.uk/files/oa-metadata-to-oa-content.pdf . Section 5 is most relevant.

Comment by Petr Knoth — March 28, 2013 @ 8:35 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

	It's time to in… on It’s time to insist on #…
	What we read this we… on It’s time to insist on #…
	Are Library Subscrip… on Where’s Waldo with Publi…
	Weekly digest: what’… on Where’s Waldo with Publi…
	Open access social s… on Where’s Waldo with Publi…

Research Remix

March 13, 2013