Research Remix

August 15, 2008

Importing PubMed MEDLINE details into mySQL database

Filed under: Uncategorized — Tags: — Heather Piwowar @ 12:45 pm

One more post in my blogging-spree:

I’m doing some text-mining with PubMed MeSH terms, titles, and abstracts. I’ve written quick-and-dirty scripts to parse and analyze PubMed citations before… but enough already. I need a reusable, stable system. Enter Java, Weka, and mySQL.

I need to pull a few thousand PubMed citations into a database. Diane Oliver, Gaurav Bhalotia, Ariel Schwartz, Russ Altman, and Marti Hearst published a paper that describes how to do just that. Complete with Java and Perl source code, and SQL scripts. Sweet!

Tools for loading Medline into a local relational database Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, ( 7 Oct 2004)

Available at BioMedCentral.

Software here.

I’ll leave the overview to the paper and instead outline the issues that I encountered when trying to get the system up and running with a mySQL database and 2008 data. Many of these will be obvious to people familiar with Java and databases. In random order:

  • use a database. I originally thought I’d just use their system to parse the XML and then mine the SQL… but it wasn’t worth it. The database calls are integrated into their code. It is easier to install and use mySQL than to work around it. Plus now it is in a database. Excellent.

In the Java code:

  • add a directory called biotextEngine above the zip extraction directories
  • it looks like maybe the XML spec changed? In change the line
    } else if (currentElement.equals(“MedlineCitationSet”)) {
    } else if (currentElement.equals(“PubmedArticle”)) {
  • my mySQL didn’t have a schema and there were connection errors. Comment out the two calls to setSchema in BioTextDBConnection
  • get the mysql driver jar file and add it to your classpath environment variable
  • Use this file:
  • #Stores the database connection specific parameters

  • to compile: javac biotextEngine/xmlparsers/medline/
  • to run: java biotextEngine/xmlparsers/medline/MedlineParser efetch.xm
  • Nice: add a System.out.println(pmid) to startElement in (within the if (currentElement != null) section, make sure to add curly braces) to keep track of progress
  • Nice: change the INSERT to a REPLACE in in case the import fails and you need to restart with some records already adde

In the .sql file for creating the database:

  • don’t run the DELETE TABLE lines unless you’ve already created them
  • change the VARCHAR lengths from 500 to 250 for vars that are included as primary keys
  • remove the word CLUSTER
  • change the word CLOB to LONGTEXT and delete other stuff on that line
  • my import file had a null grant_id, so: remove the constraint on the grants table that grant_id be null. Add an auto increment rowid to that table to use as a primary key instead of pmid+grant_id. Add pmid+grant_id as an index

And what’s up with the spacemission table???

It took me a day from beginning to end, with rusty Java and SQL. Huge thanks to the authors for this resource, and to Jon Lustgarten for convincing me that it was worth the tangent to start using Eclipse.


  1. How do you get the efetch.xml file? And how big is these file? I have to do something similar for a class project? So your answer will help me understand how to load the data locally.

    Pratik Parikh

    Comment by Pratik Parikh — September 20, 2008 @ 8:31 pm

  2. Hi Pratik. The size of the file depends on how many papers are found by your search (well, presumably you can download all of MEDLINE into a database but I have no personal experience with that)

    One easy way to get an XML file is to go to pubmed, do a search, then in the Display dropdown pick XML and in the Display dropdown pick Text. Copy and paste into a file. Note it looks like the outer wrapping XML tags aren’t included, so edit the file to include a tag around the whole thing so the Java code will work.

    Another way more complex and scriptable way is to use Entrez’s eutils. For example, do a search in PubMed, then go to details and click URL. The search terms will be encoded in the url at the top of your browser. Cut and paste the stuff after “term=
    ” into a url that accesses the PubMed details through esearch, then get the xml through efetch. Example:

    Pubmed Query for cancer:

    Now cut and paste the stuff after Term into the term field in this url to run esearch:

    Now look at the source xml text that is returned, and cut and paste the content from QueryKey and WebEnv into this url (the query_key and WebEnv are from my run, so the url below probably won’t work for you until you replace the values with those from your own esearch results):

    Hope that helps. More info on esearch and efetch, including some sample code and best practices for use, here.

    It is a bit confusing, though. Maybe I’ll try to do a more detailed post some day.

    Good luck!

    Comment by Heather Piwowar — September 22, 2008 @ 9:58 am

  3. You could use Zotero a free plugin for Firefox to get Pubmed records then export as xml.

    Comment by Lori Gawdyda — October 8, 2008 @ 8:06 am

  4. Since you use eclipse, did you set up an ant build.xml file? If so you could post your completed package with a more detailed report explaining the changes and benifit of using Eclipse.

    Also a lot of info was left out about the required classpath settings and which version of JDK/JRE this works with.

    Comment by Ben — October 10, 2008 @ 7:20 am

  5. Since you use eclipse, did you set up an ant build.xml file? If so you could post your completed package with a more detailed report explaining the changes and benifit of using Eclipse.

    Also a lot of info was left out about the required classpath settings and which version of JDK/JRE this works with.

    Comment by Ben — October 10, 2008 @ 7:22 am

  6. I would not use any java program, a simple xslt stylesheet would be enough. See this old post on


    Comment by Pierre Lindenbaum — November 2, 2008 @ 5:52 am

  7. Pierre, thanks for the pointer. That does sound simpler! I’ll give it a try next time.

    Comment by Heather Piwowar — December 29, 2008 @ 12:11 pm

  8. Hey Ben,
    Sorry for the delayed response to your comment. Yup, I did leave some things out of my post… alas I didn’t have time to do a fully detailed recounting. I tend to err on “it is better to share some info than no info” when I don’t have time to share all the info.

    I didn’t use ant, didn’t make a build.xml file. Sorry not to have other info handy… I do appreciate your detailed questions, as they remind me what to try to report next time.

    That said, the xslt stylesheet approach mentioned by Pierre in the above comment sounds simple… hope it works for you!


    Comment by Heather Piwowar — December 29, 2008 @ 12:18 pm

  9. For parsing MEDLINE in Java, LingPipe has a complete object model and parser (just updated for the 2009 NLM DTDs, which were current as of mid-December). Here’s a tutorial:

    Here’s a blog post on storing text and other data in search engines or databases based on the kind of queries (e.g. ranked text search vs. SQL joins):

    We also need to keep MEDLINE up to date with daily releases posted on NLM’s FTP site. This requires adding new citations and updating information in old citations. We have a documented sandbox project (called “lingmed”) with networking, MySQL, and Lucene linked from here:

    The setup allows remote access to indexes, using Lucene/RMI and MySQL/JDBC, with a flexible query-based data access layer using the same LingPipe MEDLINE object model. That way, a whole lab can share the resulting up-to-date indexes as if they were local.

    Comment by lingpipe — December 30, 2008 @ 11:14 am

  10. Great info, thank you for the pointers on LingPipe.

    I’ve got some more NLP fiddling in my future (=thesis) and I’m woefully underinformed about Lucene, so I appreciate these jumping-in points.

    The NLP bits will be on open access full text, then I’ll be using a subset of MEDLINE to calculate various statistics (h-indexes, article counts within MeSH terms, etc) to correlate with the NLP classifications. I’m sure there are frameworks that I can use to do these things more efficiently/scalably/reusably than rolling them myself… so again, thanks for the pointers.

    Comment by Heather Piwowar — December 31, 2008 @ 9:45 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: