Research Remix

August 15, 2008

Importing PubMed MEDLINE details into mySQL database

Filed under: Uncategorized — Tags: — Heather Piwowar @ 12:45 pm

One more post in my blogging-spree:

I’m doing some text-mining with PubMed MeSH terms, titles, and abstracts. I’ve written quick-and-dirty scripts to parse and analyze PubMed citations before… but enough already. I need a reusable, stable system. Enter Java, Weka, and mySQL.

I need to pull a few thousand PubMed citations into a database. Diane Oliver, Gaurav Bhalotia, Ariel Schwartz, Russ Altman, and Marti Hearst published a paper that describes how to do just that. Complete with Java and Perl source code, and SQL scripts. Sweet!

Tools for loading Medline into a local relational database Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, ( 7 Oct 2004)

Available at BioMedCentral.

Software here.

I’ll leave the overview to the paper and instead outline the issues that I encountered when trying to get the system up and running with a mySQL database and 2008 data. Many of these will be obvious to people familiar with Java and databases. In random order:

  • use a database. I originally thought I’d just use their system to parse the XML and then mine the SQL… but it wasn’t worth it. The database calls are integrated into their code. It is easier to install and use mySQL than to work around it. Plus now it is in a database. Excellent.

In the Java code:

  • add a directory called biotextEngine above the zip extraction directories
  • it looks like maybe the XML spec changed? In MedlineParser.java change the line
    } else if (currentElement.equals(“MedlineCitationSet”)) {
    to
    } else if (currentElement.equals(“PubmedArticle”)) {
  • my mySQL didn’t have a schema and there were connection errors. Comment out the two calls to setSchema in BioTextDBConnection
  • get the mysql driver jar file and add it to your classpath environment variable
  • Use this config.properties file:
  • #Stores the database connection specific parameters
    driverName=com.mysql.jdbc.Driver
    host=www.PutYoursHere.net

    schema=DoesNotMatterNotUsed
    dbname=PutYoursHere
    user=PutYoursHere
    passwd=PutYoursHere
    urlprefix=jdbc:mysql://
    port=3306
  • to compile: javac biotextEngine/xmlparsers/medline/MedlineParser.java
  • to run: java biotextEngine/xmlparsers/medline/MedlineParser efetch.xm
  • Nice: add a System.out.println(pmid) to startElement in MedlineCitation.java (within the if (currentElement != null) section, make sure to add curly braces) to keep track of progress
  • Nice: change the INSERT to a REPLACE in NodeHandler.java in case the import fails and you need to restart with some records already adde

In the .sql file for creating the database:

  • don’t run the DELETE TABLE lines unless you’ve already created them
  • change the VARCHAR lengths from 500 to 250 for vars that are included as primary keys
  • remove the word CLUSTER
  • change the word CLOB to LONGTEXT and delete other stuff on that line
  • my import file had a null grant_id, so: remove the constraint on the grants table that grant_id be null. Add an auto increment rowid to that table to use as a primary key instead of pmid+grant_id. Add pmid+grant_id as an index

And what’s up with the spacemission table???

It took me a day from beginning to end, with rusty Java and SQL. Huge thanks to the authors for this resource, and to Jon Lustgarten for convincing me that it was worth the tangent to start using Eclipse.

Quosa usage can violate PubMed Central terms of service

Filed under: Uncategorized — Tags: , , — Heather Piwowar @ 11:20 am

Has anyone else had a problem with Quosa and PubMed Central?

Quosa sounds great.  “Full-text journal workflow solutions.”  Exactly what I need, poof, no custom code required.

I downloaded the free demo.  I wanted some full-text for text mining: a large number of articles from years ago… they are available for free on the publisher’s website.  Quosa was simple to get going, and the first articles were retrieved with no problem.  But soon, lo and behold, rather than the articles I was getting a webpage with a message from PubMed Central:  IP blocked due to Bulk downloading of content.  Aaaaah!

I’m aware of PMC’s policy and have designed custom download methods to respect it (while working to get it changed).  I didn’t realize that Quosa was downloading the articles from PMC.  In retrospect it is clear, because the early articles it retrieved were in PMC format, though that was not obvious from the download interface.  Most people aren’t going to be aware of PMC’s restrictions, and so won’t know to look for the problem.

Quosa doesn’t seem to have any mechanisms by which one can specify the source of the articles (so that I could have requested they be downloaded from the publisher’s website).   I must admit in my late-evening haste I didn’t read the licencing agreements when I installed the demo… perhaps there were some warnings there?  The issue doesn’t seem to be covered in their FAQ.

Hmmmmm.  No more Quosa for me.  I’ll email them and ask for clarification, for the record.  I’ve also emailed PMC to ask for forgiveness and an unblocking.

April 7, 2008

Non-OA Full-text for text mining

Filed under: openaccess — Tags: , — Heather Piwowar @ 9:28 am

Interesting discussion on Peter Murray-Rust’s blog about whether PubMed Central articles can be crawled and used for text mining. The answer is no, not now, not unless they are open access (as opposed to traditional closed access but deposited in PMC).  Really unfortunate.  Incremental progress, we’ll get there.

Anticipating my thesis work, I’ve been wondering about similar text mining questions. I think my needs are a bit different than those of PMR: I’m interested in papers that meet a targeted search, rather than all articles or all articles in relevant journal (what I gather he’s be interested in?). I’m willing to limit myself to the articles that I have access to through my University’s subscriptions. I don’t need figures. I think once I have the papers I’m allowed to text mine them as fair use, since I have them under permission. So the question is what can I automatically download?

I learned I can’t spider PMC, but what about normal PubMed? Try as I might, I couldn’t find verbage on the PubMed website allowing/disallowing spidering through to full-text links on publisher websites (the links that are populated and visible when I’m logged in through the University’s connection). Is this allowed? Still seems like it might not be. And then you end up at the publisher sites anyway, with all of their differing rules. Unfort, the publisher’s rules are often hard to find, confusing, and vague (as often noted by PMR and others). Aaaaah.

So last month I asked our librarians….

As you know, PMC has OA and non-OA full-text. They make their OA text available via FTP etc, and they stipulate that those mechanisms are the only way that people are allowed to access the full text “because of copyright restrictions” [http://www.pubmedcentral.nih.gov/about/copyright.html]. I’d also like to access non-OA text for which Pitt has subscriptions, but it sounds like I can’t do this by “crawling” PMC based on their rules [explicitly stated in the link above]. I guess I’m wondering if I can do it by “crawling” the normal, full PubMed. Basically write a script to find the “HSLS” links on the article citation pages, follow them (usually into the publisher’s websites), and automatically save the html or pdf articles that are returned from a PubMed query.

There is no difference in the end result from me manually clicking through and saving the papers… but there is sure a difference in the manual time requirement! I wouldn’t have thought this sort of automated downloading would be a problem… but the Restrictions on Systematic Downloading of articles in the PMC copyright notice referenced above makes me want to double-check. I can’t find any reference to “crawling” or “systematic downloads” for PubMed itself.

I do understand there are user requirements when using the Entrez programming utilities (run automated queries during off hours, 3 seconds between queries, etc) and I would be sure to honor those both with the elements of my scripts which use the E-tools and those which are crawling the web pages directly.

Does that make sense? Are you aware of any restrictions for crawling PubMed to automatically access and save content for which I do indeed have access through Pitt? I guess since I’m going into the publisher’s websites, they might also have restrictions? Is there another way to consolidate a large set electronic full-text articles (ideally a few thousand)?

Thanks very much for any pointers you may have.

The librarian responsed that automatically following PubMed links should be fine, and that there shouldn’t be problems from publisher sites because we have subscriptions and my text mining falls under fair use. I’ll add that I think it helps that I’m not aiming to download full editions, because I do know that some publisher websites disallow that.

Maybe I shouldn’t be bringing it up again here, since it feels like I’ve been given an institutional “All Clear.” But no sense burying my head in the sand in case there really are issues: I want to know. Web downloading policies and full-text reuse policies are so complicated. I’ve spent time looked into them, but it sure seems like unless it is your full-time job it is impossible to understand and keep on top of how it works. I don’t think our librarians deal with these issues every day. Who else would I go to for clarification?

Does anyone have differing interpretations, warnings, reassurances, alternatives, and general paths through this crazy mess? How do other people do this???

Blog at WordPress.com.