Research Remix

December 29, 2008

Papers authentication for University of Pittsburgh

Filed under: Uncategorized — Heather Piwowar @ 12:08 pm

I’ve been using Papers to keep track of research article PDFs on my Mac.  It is fantastic… revolutionizes the way I’m doing mini lit reviews.   Reduces the need to refind papers (a pain even for open access ones), and allows me to search across full-text in a way I hadn’t been doing until now.

(a bit of a pain to keep synched with citeulike and biblio software, but that’s life)

It is especially fantastic now that I’ve found the magic words to let it automatically connect to University of Pittsburgh online resources, making it easier to locate and attach PDFs.  Thanks to the anonymous person who posted this on a random Papers forum thread…. posting it here in case it makes it easier for someone to find via Google.

In Preferences, under authentication url, enter https://sslvpn.pitt.edu and check on the “go to this page when Papers starts” box.

Under library proxy, enter https://sslvpn.pitt.edu/dana/home/launch.cgi?url=%@

When Papers starts, enter your Pitt ID and password and follow your nose.

Generalizability coefficient for Mechanical Turk annotations

Filed under: Uncategorized — Heather Piwowar @ 11:14 am

Hello dear neglected blog.  I’ve been busy working on various things and posting about none of them, unfort.  Just saw a post by Jean-Claude Bradley that has spurred me to action, and reminded me about why open science and open halfway-done-results are useful… you never know who might be thinking about similar things.

To wit:  I spent a few weeks earlier this month experimenting with Mechanical Turk.  I was planning to use it to get annotations for my thesis… then lo and behold, a ready made gold standard arrived on my doorstep, so unfort I’m putting further experimentation on the back burner.

As part of the early fiddling, though, I did dig out a bunch of lovely references about Mechanical Turk, draft a few paragraphs that were going to go in my proposal, run a pilot study, and write some R code to estimate generalizability coefficient.  Maybe this will be useful to someone?

In case you aren’t familiar, a generalizability coefficient is a measure of reliability for annotations.  A rule of thumb is that a gen coef of 0.7 is good if you are going to use the annotations to evaluate a system, but >0.9 is necessary if you are going to use the annotations to refine the system on a case-by-case basis.  Hripsek et al. wrote a nice paper about this.  I’ve refined their approach a bit to account for the fact that when using Mech Turk you don’t have a fixed set of annotators across all the questions, based on the excellent tutorial by Michael Brannick here on the Shrout and Fleiss approach.

Warning:  I’m not a stats expert and this code is rough, so no guarantees… but I think it is a valid approach.  Critiques and thoughts?  Leave a comment, I’d love to hear them.

Ok, with that preamble, here’s a raw cut and paste of my work to date.  Let me know if it isn’t understandable… apologies for lack of comments, etc.   I’ll maybe try to pretty up the descriptions and code a bit and submit to Nature Precedings so that it is citable, at some point when I have time.  In the mean time, better shared in a dirty state then hidden on my harddrive, right?

For what it is worth… Mechanical Turk looked like a really promising approach for my application :)

Generalizability Coefficient for Mechanical Turk Annotations
Heather Piwowar
Rough Draft, Dec 2008

Amazon’s Mechanical Turk for annotation

It is challenging to assemble an unbiased reference standard quickly and inexpensively.  Several recent studies have explored using Amazon’s Mechanical Turk, a distributed online micro-market, to annotate text[1-5]. Their experiences suggest that this approach to collecting and consolidating non-expert labels is a feasible method for constructing a useful annotation corpus.

While some data is very noisy, the invalid responses are due to a small minority of users[6] and do not constitute a major problem.  Snow et al. estimates that four non-exert labelers per item are sufficient to emulate expert-level label quality for an affect recognition task[2].  Some research has demonstrated methods for reweighting[2] and developed recommendations for how to develop a good task.[6, 7]

Since a typical job (or HIT, for Human Intelligence Task) typically pays only a few cents, collecting annotations this way can be very cost effective.  However, there are a few reasons not to pay too little:  low payments may lead to sloppier annotations[7], ethical concerns about offering a fair wage [8], and tasks with very low bids simply won’t be picked up by workers (Turkers).

Pilot annotation study with Amazon’s Mechanical Turk
We conducted a pilot annotation study with Amazon’s Mechanical Turk to estimate the accuracy with which annotation tasks can be performed by this group of non-experts, the number of independent annotations necessary to get sufficient generalizability, and the cost of annotation.

We selected the task of identifying which papers about microarray data generated microarray data, since it is more nebulous, holistic, and challenging than identifying statements of data sharing.  We selected 10 random articles from PubMed Central that contained the words gene, expression, and microarray.  We provided background information about the nature of gene expression experiments and criteria for what we consider a paper that includes a gene expression microarray experiment to include.  We then provided a link to PMC and asked a) whether the paper included the running of a gene expression microarray experiment, and b) to cut and paste an excerpt from the paper that supports their choice.

The results had high accuracy with my “expert” annotations.  The majority vote of the five annotations was the same as my annotations in 9 of 10 cases, indicating a 90% accuracy, albeit with a very wide conference interval going from 0.54 to 0.99.  A chi-squared test between the majority vote and my annotations was not significant (p=0.07).  As another validation, the majority vote identified all 4 of the 10 articles with links from GEO as “dataset creating” and the 2 of 10 articles in bioinformatics journals (a proxy for non-dataset producing) as not dataset creating.  This preliminary evidence supports the validity of the Turker annotations.

To establish the reliability of the annotations, I calculated the generalizability coefficient for random annotators.  The calculation approach parallels that outlined by Hripcsak et al. [9], but assumes that each question is answered by a different, randomly-selected set of annotators.  Using the formulas of [10] (helpfully interpreted at [11]), the pilot generalizability coefficient if 0.87 and the estimated number of similar annotations to achieve various target generalizability coefficients are given in the table below.

Table 1:  Generalizability coefficients

Target generalizability coefficient    |  Estimated number of annotations per task required to achieve target, for the pilot task
.5    2
.6    3
.7    4
.8    6
.9    13

Hripcsak et al. [9] proposes that for system evaluation a generalizability coefficient of 0.7 is sufficient.  Our results suggest that 4 annotators would be sufficient for this task.  This number is consistent with the findings of Snow et al.[2].  We will likely choose 5 annotators, to facilitate a majority vote without ties.
We ran the experiment in two phases.  In the first, we recruited 2 opinions per paper $0.50 per answer + $0.50 bonus for each answer that matched the “expert” opinion.  In the second phase, we recruited an additional 3 opinions at $0.15 per answer + $0.10 bonus.  Accuracy and accruement rates did not vary with payment.
In summary, we believe that using Amazon’s Mechanical Turk will be an efficient, accurate way to establish gold standards for this study.  We anticipate gathering 5 opinions per article at a cost of $0.25 each.

Risks and Contingency Plans
I have contingency plans in case the annotation through Mechanical Turk is very noisy.  I would refine the annotation task to  provide gold standard feedback[7], use a qualification test[12], or weight labelers based on accuracy[2].

1.    Sheng, V., F. Provost, and P. Ipeirotis, Get Another Label? Improving Data Quality and Data Mining. archive.nyu.edu, 2008.
2.    Snow, R., et al., Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. Proceedings of EMNLP-08, 2008.
3.    Nakov, P., Paraphrasing Verbs for Noun Compound Interpretation. Proc. of the Workshop on Multiword Expressions, 2008.
4.    Li, B., et al., Exploring question subjectivity prediction in community QA. Proceedings of the 31st annual international ACM SIGIR …, 2008.
5.    Yakhnenko, O. and B. Rosario, Mining the Web for Relations between Digital Devices using a Probabilistic Maximum Margin Model. Proceedings of the Third International Joint Conference on Natural Language Processing, 2008.
6.    Kittur, A., E.H. Chi, and B. Suh, Crowdsourcing user studies with Mechanical Turk. CHI ‘08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, 2008.
7.    Sorokin, A. and D. Forsyth, Utility data annotation with Amazon Mechanical Turk. Computer Vision and Pattern Recognition Workshops, 2008.
8.    Pontin, J., Artificial Intelligence, With Help From the Humans. The New York Times, 2007(March 24, 2007).
9.    Hripcsak, G., et al., A Reliability Study for Evaluating Information Extraction from Radiology Reports. Journal of the American Medical Informatics Association, 1999.
10.    Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 1979.
11.    Brannick, MT.  Shrout and Fleiss Computations for Intraclass correlations for interjudge reliability. URL:http://luna.cas.usf.edu/~mbrannic/files/pmet/shrout1.htm. Accessed: 2008-12-29. (Archived by WebCite® at http://www.webcitation.org/5dQr5D5ta)
12.    Dakka, W. and P. Ipeirotis, Automatic Extraction of Useful Facet Hierarchies from Text Databases. Data Engineering, 2008.

Appendix:  R code for generalizability coefficient

## Code by Heather Piwowar,  December 2008

## Approach inspired by Hripcsak G, Kuperman GJ, Friedman C, Heitjan DF. A reliability study for evaluating information extraction from radiology reports. J Am Med Inform Assoc 1999;6:143–50.
## Equations from the helpful tutorial at the webpage Shrout and Fleiss Computations for Intraclass correlations for interjudge reliability. URL:http://luna.cas.usf.edu/~mbrannic/files/pmet/shrout1.htm. Accessed: 2008-12-29. (Archived by WebCite® at http://www.webcitation.org/5dQr5D5ta)
## Seminal paper:  Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 1979.

turk = read.csv(“Documents/Thesis/Mechanical Turk/ForGeneralizability.csv”)
turk$PMCID = as.factor(turk$PMCID)

model.turk = lm(answer~PMCID+WorkerId,data=turk)
aov.turk = anova(model.turk)
aov.turk

Annotators.MS = aov.turk$”Mean Sq”[2]
error.MS = aov.turk$”Mean Sq”[3]
questions.MS = aov.turk$”Mean Sq”[1]
Annotators.df = aov.turk$”Df”[2]
questions.df = aov.turk$”Df”[1]

# just 1 for random annotators
ICC.21 = (questions.MS – error.MS)/(questions.MS + (Annotators.df*error.MS)+ (Annotators.df+1)*(Annotators.MS – error.MS)/(questions.df + 1))

# fyi, all k for random annotators
#(questions.MS – error.MS)/(questions.MS + (Annotators.MS – error.MS)/(questions.df + 1))

# number of judges required for a target generalizability
target.gen = seq(0.5, 0.9, by=0.1)
num.required = (target.gen*(1-ICC.21)) / (ICC.21*(1-target.gen))
cbind(target.gen, ceiling(num.required))

# now compare to gold standard
tapply(turk$answer, turk$PMCID, mean)
answers.voted = round(tapply(turk$answer, turk$PMCID, mean))
answers.goldstandard = round(tapply(turk$goldstandard, turk$PMCID, mean))
chisq.test(answers.voted, answers.goldstandard)
twobytwo = table(answers.voted, answers.goldstandard)
correct = twobytwo[1,1]+twobytwo[2,2]
total = sum(twobytwo)
accuracy = (correct)/total
prop.test(correct, total)

# example from webpage, for testing
#Annotators.MS = 32.48
#error.MS = 1.02
#questions.MS = 11.24
#Annotators.df = 3
#questions.df = 5

#### other approaches

# library(irr)  # is only for each rater does all data

### code following Hripsak approach

Case.estvar = (questions.MS-error.MS)/(Annotators.df+1.0)

#Raters.estvar =
#=(D6-H7)/(C5+1.0)

Resid.estvar = error.MS

ReliabRel.1 = Case.estvar/(Case.estvar+Resid.estvar)
ReliabRel.1

N = 1:10
ReliabRel.n = Case.estvar/(Case.estvar+(Resid.estvar/N))
cbind(N, round(ReliabRel.n, 2))

#### code for fixed raters rather than random

# just 1 if fixed
(questions.MS – error.MS)/(questions.MS + (Annotators.df)*error.MS)

# all k if fixed
(questions.MS – error.MS)/questions.MS

Appendix:  ForGeneralizability.csv

HITId,PMCID,WorkerId,answer,goldstandard
0TDPSQ08X3AMQ4ZW6ZA0,2098796,A3MSJZLFHWJX6O,1,1
0TDPSQ08X3AMQ4ZW6ZA0,2098796,ACR51T5SRMDSJ,1,1
0TDPSQ08X3AMQ4ZW6ZA0,2098796,A2LV383XB7YIDZ,1,1
5YNCKRD84TQ6RYYTPZA0,1941755,A3NG92KYTPURW,0,0
5YNCKRD84TQ6RYYTPZA0,1941755,A3OWHW7XYQU52K,0,0
5Z4024YEMWDWKZ4XGAH0,2242853,A3MSJZLFHWJX6O,0,0
5Z4024YEMWDWKZ4XGAH0,2242853,AS51L372B5DDV,0,0
5Z4024YEMWDWKZ4XGAH0,2242853,A2LV383XB7YIDZ,0,0
6KPRS6ZJVSE0RNWGWZ1Z,2082469,A306GHG0P8TIEJ,1,1
6KPRS6ZJVSE0RNWGWZ1Z,2082469,A3NG92KYTPURW,1,1
823Z2DY2JVCYN1ZSAX3Z,2323238,A3OWHW7XYQU52K,0,0
823Z2DY2JVCYN1ZSAX3Z,2323238,A3NG92KYTPURW,1,0
8XVAZHYKE3WMM0YB4WCZ,1894823,AS51L372B5DDV,1,1
8XVAZHYKE3WMM0YB4WCZ,1894823,A2LV383XB7YIDZ,1,1
8XVAZHYKE3WMM0YB4WCZ,1894823,A3MSJZLFHWJX6O,1,1
AWBZVDSHZ0XERFSAZ2TZ,2082469,A3MSJZLFHWJX6O,1,1
AWBZVDSHZ0XERFSAZ2TZ,2082469,A2LV383XB7YIDZ,1,1
AWBZVDSHZ0XERFSAZ2TZ,2082469,A306GHG0P8TIEJ,1,1
BRBZ57S7VGDZ0KY9V3R0,2323238,A2LV383XB7YIDZ,1,0
BRBZ57S7VGDZ0KY9V3R0,2323238,A3MSJZLFHWJX6O,1,0
BRBZ57S7VGDZ0KY9V3R0,2323238,A2HNP1YL1IBFMU,1,0
DSPZXX8X6X2ZTK11VZK0,2175511,A3MSJZLFHWJX6O,0,1
DSPZXX8X6X2ZTK11VZK0,2175511,A25ZA4IBIEZGNC,1,1
DSPZXX8X6X2ZTK11VZK0,2175511,A2LV383XB7YIDZ,1,1
DWAZWERHTZEZK4M9ZYK0,2222648,A2LV383XB7YIDZ,1,1
DWAZWERHTZEZK4M9ZYK0,2222648,A306GHG0P8TIEJ,1,1
DWAZWERHTZEZK4M9ZYK0,2222648,A3MSJZLFHWJX6O,1,1
EZNZV0ZVJWEZK1ZGQYCZ,2217565,A25ZA4IBIEZGNC,1,0
EZNZV0ZVJWEZK1ZGQYCZ,2217565,A2LV383XB7YIDZ,0,0
EZNZV0ZVJWEZK1ZGQYCZ,2217565,A3MSJZLFHWJX6O,0,0
FXRMTWGRTW3GR9ZM2XQZ,2098796,A3OWHW7XYQU52K,0,1
FXRMTWGRTW3GR9ZM2XQZ,2098796,A3NG92KYTPURW,1,1
KTBT37CQYYFZHN07XJBZ,2175511,A3OWHW7XYQU52K,1,1
KTBT37CQYYFZHN07XJBZ,2175511,A3NG92KYTPURW,1,1
PW8ZP61NX0W4NY5D2ZJ0,2048729,A3NG92KYTPURW,1,1
PW8ZP61NX0W4NY5D2ZJ0,2048729,A3OWHW7XYQU52K,1,1
PXKZTQ5V839ZY6N7CXMZ,2217565,A3NG92KYTPURW,0,0
PXKZTQ5V839ZY6N7CXMZ,2217565,A3OWHW7XYQU52K,0,0
QXRZMCRJGZ7ZYXS4ASY0,2242853,A3NG92KYTPURW,0,0
QXRZMCRJGZ7ZYXS4ASY0,2242853,A3OWHW7XYQU52K,0,0
QXRZZXYSTXEZXTZBFXP0,1894823,A3OWHW7XYQU52K,1,1
QXRZZXYSTXEZXTZBFXP0,1894823,A3NG92KYTPURW,1,1
QZN0SDMVMZGZME5SZWSZ,2222648,A1AJRJD85MWA62,1,1
QZN0SDMVMZGZME5SZWSZ,2222648,A3NG92KYTPURW,1,1
R25TJZW9QAARPJZPKB20,2048729,A2LV383XB7YIDZ,1,1
R25TJZW9QAARPJZPKB20,2048729,A3MSJZLFHWJX6O,1,1
R25TJZW9QAARPJZPKB20,2048729,AS51L372B5DDV,1,1
ZBVJ1JZZDY1Z4CC5Q2ZZ,1941755,A2LV383XB7YIDZ,0,0
ZBVJ1JZZDY1Z4CC5Q2ZZ,1941755,A306GHG0P8TIEJ,0,0
ZBVJ1JZZDY1Z4CC5Q2ZZ,1941755,A3MSJZLFHWJX6O,1,0

October 8, 2008

Requesting feedback on open process for content standard dev

Filed under: Uncategorized — Heather Piwowar @ 2:22 pm

Our innovative and active center for dental informatics is doing some great work.  Anyone have feedback on the request below?  Contact Titus directly.

Subject: Development of an open-source standard for the content of
patient records in general dentistry

Hi everybody,

As some of you may know, we are working on an information model for
patient records in general dentistry. Amit Acharya, a visiting scholar
in the Department of Biomedical Informatics, has been focusing most of
his energy on that project. We have now accumulated a relatively large
list (approximately 1,300 entries) of data elements that could become
part of the information model. This project is part of the portfolio of
the American Dental Association’s Standards Committee for Dental
Informatics (SCDI).

We now face the challenge of validating the list and, later on,
structuring it into an information model. One of the approaches we have
discussed is to do this through an open-source approach with broad input
from many constituencies. Clearly, that is what the SCDI process
entails. However, the old-fashioned approach to developing standards
(production of a draft in a working group over a certain length of time,
posting for All-Interested-Parties comment, reconciliation of input
received, etc.) is somewhat clumsy and cumbersome.

We would like to harness the power of the Internet for this instead
(broad dissemination of early drafts, open annotation of data elements,
some editorial process to keep the whole project in bounds, etc.). We
were wondering whether anyone has seen this kind of approach for
development of a content standard. As is obvious from my comments, we
are looking for a lightweight, agile process that produces results
faster than traditional bureaucratic processes.

If you have any ideas about this, please let us know.

Thanks

Titus

Workshop on Finding and Re-using Public Information

Filed under: Uncategorized — Heather Piwowar @ 2:16 pm

From Jonathan Gray of OFKN.  It sounds fun, I need a teleporter!

We are pleased to announce a workshop on ‘Finding and Re-using Public
Information’, co-organised with the Office of Public Sector Information
(OPSI), the Power of Information (POI) Taskforce and mySociety:

http://blog.okfn.org/2008/10/08/workshop-on-finding-and-re-using-public-information/

Details are as follows:

* When: Saturday 1st November 2008, 1030-1600
* Where: London Knowledge Lab, 23-29 Emerald Street, London, WC1N 3QS.
(see map http://ur1.ca/87q)
* Wiki: http://okfn.org/wiki/PublicInformation
* Participation: Attendance is free. If you are planning to come along
please add your name to the participants list, or email us (info at the
okfn domain).

The UK Government produces and distributes a vast amount of documents
and datasets – from national statistics to environmental information,
from socio-economic data to legal material. Recent technologies allow
this information to be explored, built upon and made accessible in new
ways – whether through visual representation, semantic interlinking, or
through social media applications.

This informal, hands-on workshop will bring government information
experts together with those who are interested in finding and re-using
government information. In addition to focused discussions about legal
and technological aspects of re-use, government information assets will
be documented and tagged on CKAN (http://ckan.net), a registry of
knowledge resources.

September 18, 2008

NCRR strategic plan: strategy to “Facilitate information sharing among biomedical researchers”

Filed under: Uncategorized — Heather Piwowar @ 10:22 am

The US NIH National Center for Research Resources (NCRR) has released its strategic plan for 2009-2013.   Available here in various formats.

Of special note to those interested in data sharing, Strategic Initiative IV, Informatics Approaches to Support Research, includes:

Strategy 1:  Facilitate information sharing among biomedical researchers

…Sharing of raw data has become commonplace in some fields, including the human genome sequence, as well as those of many other species. Other genetic and phenotype data are being collected and made available through the National Library of Medicine. Additionally, many NIH ICs, other federal agencies, and private organizations make data available for research. Many tools for data analysis also are broadly and freely available from NIH, investigators, and organizations. However, there are areas for progress in data availability and, in particular, in sharing of metadata associated with the data (i.e., data that increase the usability and quality of the data)….

Sharing of de-identified raw clinical data and clinical research data is also common, including from the Centers for Medicare and Medicaid Services, NIH studies, and foundations. Careful attention is required to assure privacy and confidentiality in sharing and use of human data for research. Differing and conflicting regulations and approaches have made sharing of clinical data more difficult…

Many challenges still exist to facilitate information sharing for biomedical research:

  • Issues related to accessing and querying text data are well advanced; however, approaches and tools for querying other types of data are much less well developed. This includes image, gene, structural, clinical, and digital data.
  • Collection of metadata associated with the data from all sources is critical for semantic and syntactic interoperability….
  • Data models, structures, and formats also are critical for the sharing of data. Lack of industry standards for machine data and lack of agreement, particularly in emerging research areas, make sharing difficult. There are ongoing efforts in many communities related to these issues.
  • Clear and common agreements on policy and technology requirements for sharing of human data would facilitate sharing of these data.
  • Access to tools for management and analysis of the data is necessary. Many tools and computer cycles can be made widely available using a grid-based computational structure.

Action Items: NCRR will:

  • Work to implement policies that encourage or require investigators to share data collected with NIH support and to describe their data-sharing plans in detail in their applications.
  • Work with academic institutions, patient advocacy groups, NIH ICs, and other agencies to develop procedures that facilitate the sharing of human data for research by its centers and programs while protecting confidentiality and privacy.
  • Continue to support and modernize the BIRN and RTRN data-sharing infrastructure and attempt to facilitate the use of that infrastructure in a variety of research communities.
  • Continue to support COBRE and INBRE bioinformatics core facilities, computational resources, and network connectivity upgrades at IDeA-eligible institutions.
  • Explore ways to work with NSF, DOE, other agencies, and industry to develop tools to analyze large amounts of data and to develop tools to query heterogeneous datasets.

Strategy 5:  Develop an online resource knowledge community for biomedical researchers

To establish effective collaborations and partnerships and use the most effective tools, researchers must have access to and knowledge of state-of-the-art resources, technologies, and people in relevant areas. Many online resource and collaboration networks are arising, driven by this need. However, information about many NIH-supported resources is fragmented and difficult to locate, even via Internet searches. NCRR provides many resources in its multiple programs that could be further utilized.

Action Item: NCRR will pursue the development of a Web-based knowledge community of NCRR resources that encourages access by all biomedical researchers. NCRR will explore tools that allow users to interactively query the resources and community, analyze spatial information, and explore relationships.

Great to see these issues highlighted.  I do think it would be valuable to have additional grant opportunities to evaluate data sharing progress, measure cost/benefits, and focus on reuse.  That said, I bet everybody thinks it would be valuable to have more grant opportunities in their research niche ;)

September 16, 2008

BHAG for Openness

Filed under: Uncategorized — Heather Piwowar @ 1:48 pm

Our department holds annual “state of the union” addresses, to bring everyone up to date on accomplishments, challenges, and goals.  Our department is young (though it has rich, deep, old roots within the univisity in other incarnations) and as such is still forging an identify.

Last year, our leadership stated that we want to aim to become the “best department of informatics.”   This inspired me to consider what it meant and how we might do it.  I wrote and sent the letter below.

Although our leadership has decided on a different path, I think the idea has merit.  Here you go, internet!  Have you been dabbling in openness and want to take on a challenge?  Ask your department, team, or lab if they are up for a Big Hairy Audacious Goal for Openness :)


Hi Dr B,

I’ve been thinking about the State of the Department presentation you gave last June, specifically the mention of your/our goal of becoming the very best department of biomedical informatics.  I’d like to share a brainstorm with you, if I may, in case it is of interest?

Becoming the Best Department feels like a worthy and important goal to me, but nebulous and distant.   I know that for myself and teams I’ve worked on in the past, it has always helped to have a specific BHAG.  As you likely know, a Big Hairy Audacious Goal — as described in “Built to Last” and “Good to Great” by J. Collins — is something like “a man on the moon in ten years” or “a computer on every desk.”  It would give us a shared vision about what kind of Great Department we would be, how we’d know when we got there, and motivation along the way.

I’d like to propose an idea for a BHAG.  I suggest that we “become the most open department of biomedical informatics.”  By that, I mean we embrace open access, open source, open notebook/process science, open teaching….  the whole shebang.  I think that we already have some core competencies in this area, and are well situated to become a leader.  I believe that this will become the way of the future, though it is still early days and frankly it wouldn’t take much to become a leader at the moment.

By striving for and eventually achieving this BHAG we will accomplish several things:

  • a reputation for leadership in an area which spans all of our diverse projects
  • awareness and citations for our research, as our work is made more freely and widely available
  • attractiveness to elite talent; I believe that some of the best and brightest programmers and academics are attracted to open ideals
  • synergy with our focus on increased publishing output
  • a sense of team

Admittedly, we will encounter obstacles with IRBs, university legal and IP departments, protective researchers, AMIA establishment, and the like.  There are indeed real concerns about trying to do open biomedical research, but I believe that all the issues can be addressed appropriately while striving to be as open as possible, given the real constraints.

To make the idea concrete, here are a few steps which I think would get us a long way down the leadership path:

  • strongly encourage our researchers to self-archive all of their non-open access papers in a global or institutional repository
  • strongly encourage our researchers to make their posters and preprints available on Nature Precedings, or similar
  • provide department funding for publishing in author-pays open access journals
  • strongly encourage our researchers to publish in open access journals
  • strongly encourage our researchers to make their software and statistical scripts open source
  • strongly encourage our researchers to make all data (as appropriate given privacy concerns) publicly available when they publish their papers
  • ensure that all faculty and students have webpages articulating what they are working on, with links to available papers and data
  • take a leadership role within JAMIA to encourage an author-pays open-access option
  • take a leadership role within AMIA to ensure that proceedings are available open access ASAP
  • encourage students and faculty to experiment with Open Notebook Science
  • encourage students and faculty to participate in professional e-communities like Nature Network, Scintilla, and Linked-in
  • work within the NIH and NLM to help as they increase the openness of their projects and the projects they fund
  • put our course documents available on the open web
  • put all of our theses available on the open web

Needless to say, if you are interested in more ideas, just let me know :)  As you can tell, this is an area I deeply care about.  I believe it advances science and engineering as they should be, and will ultimately help advance biomedicine particularly in an age of limited funding.  Furthermore, I believe that working towards this goal would be a boon to our department.

In any event, I certainly appreciate your open door, and the consideration you give these thoughts as part of your vision.

Sincerely,
Heather Piwowar

Pedersen on software availability: full-text available

Filed under: Uncategorized — Heather Piwowar @ 9:13 am

Here the full-text of Ted Pedersen’s article I mentioned yesterday.  Sorry to have missed the self-archive initially.

Enjoy, disseminate, and consider when prioritizing your work….

Empiricism is Not a Matter of Faith (Pedersen), Computational Linguistics, Volume 34, Number 3, pp. 465-470, September 2008.

September 2, 2008

Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers

Filed under: Uncategorized — Heather Piwowar @ 8:04 am

Our paper encouraging data sharing leadership from medical schools and academic-affiliated hospitals has been published today in the Policy Forum at PLoS Medicine:

Citation: Piwowar HA, Becich MJ, Bilofsky H, Crowley RS, on behalf of the caBIG Data Sharing and Intellectual Capital Workspace (2008) Towards a Data Sharing Culture: Recommendations for Leadership from Academic Health Centers. PLoS Med 5(9): e183 doi:10.1371/journal.pmed.0050183

The Policy Forum allows health policy makers around the world to discuss challenges and opportunities for improving health care in their societies.

Sharing biomedical research and health care data is important but difficult. Recognizing this, many initiatives facilitate, fund, request, or require researchers to share their data. These initiatives address the technical aspects of data sharing, but rarely focus on incentives for key stakeholders. Academic health centers (AHCs) have a critical role in enabling, encouraging, and rewarding data sharing. The leaders of medical schools and academic-affiliated hospitals can play a unique role in supporting this transformation of the research enterprise. We propose that AHCs can and should lead the transition towards a culture of biomedical data sharing.

The benefits of data sharing and reuse have been widely reported. We summarize them here, from the perspective of an AHC.

Thanks to Mike Becich for getting the ball rolling, Howard Bilofsky for his emphasis on metrics, caBIG participants and PLoS reviewers and editors for important comments, Randen Pederson for making his thumbnail cairn photo available in Flickr under CC-BY, and especially to Rebecca Crowley for inspired discussions, text, edits, enthusiasm, and mentoring throughout the process.

PLoS wrote up a press release.  Cool, eh?  :)

The paper isn’t perfect, but we do hope it will continue to raise awareness (and action!) about benefits and pratical steps for increased scientific openness.

If you have comments, please Write a Response to the paper at the PLoS Medicine site to encourage a broad conversation.  Thanks!

August 15, 2008

Importing PubMed MEDLINE details into mySQL database

Filed under: Uncategorized — Tags: — Heather Piwowar @ 12:45 pm

One more post in my blogging-spree:

I’m doing some text-mining with PubMed MeSH terms, titles, and abstracts. I’ve written quick-and-dirty scripts to parse and analyze PubMed citations before… but enough already. I need a reusable, stable system. Enter Java, Weka, and mySQL.

I need to pull a few thousand PubMed citations into a database. Diane Oliver, Gaurav Bhalotia, Ariel Schwartz, Russ Altman, and Marti Hearst published a paper that describes how to do just that. Complete with Java and Perl source code, and SQL scripts. Sweet!

Tools for loading Medline into a local relational database Diane E. Oliver, Gaurav Bhalotia, Ariel S. Schwartz, Russ B. Altman, Marti A. Hearst, BMC Bioinformatics 2004, ( 7 Oct 2004)

Available at BioMedCentral.

Software here.

I’ll leave the overview to the paper and instead outline the issues that I encountered when trying to get the system up and running with a mySQL database and 2008 data. Many of these will be obvious to people familiar with Java and databases. In random order:

  • use a database. I originally thought I’d just use their system to parse the XML and then mine the SQL… but it wasn’t worth it. The database calls are integrated into their code. It is easier to install and use mySQL than to work around it. Plus now it is in a database. Excellent.

In the Java code:

  • add a directory called biotextEngine above the zip extraction directories
  • it looks like maybe the XML spec changed? In MedlineParser.java change the line
    } else if (currentElement.equals(“MedlineCitationSet”)) {
    to
    } else if (currentElement.equals(“PubmedArticle”)) {
  • my mySQL didn’t have a schema and there were connection errors. Comment out the two calls to setSchema in BioTextDBConnection
  • get the mysql driver jar file and add it to your classpath environment variable
  • Use this config.properties file:
  • #Stores the database connection specific parameters
    driverName=com.mysql.jdbc.Driver
    host=www.PutYoursHere.net

    schema=DoesNotMatterNotUsed
    dbname=PutYoursHere
    user=PutYoursHere
    passwd=PutYoursHere
    urlprefix=jdbc:mysql://
    port=3306
  • to compile: javac biotextEngine/xmlparsers/medline/MedlineParser.java
  • to run: java biotextEngine/xmlparsers/medline/MedlineParser efetch.xm
  • Nice: add a System.out.println(pmid) to startElement in MedlineCitation.java (within the if (currentElement != null) section, make sure to add curly braces) to keep track of progress
  • Nice: change the INSERT to a REPLACE in NodeHandler.java in case the import fails and you need to restart with some records already adde

In the .sql file for creating the database:

  • don’t run the DELETE TABLE lines unless you’ve already created them
  • change the VARCHAR lengths from 500 to 250 for vars that are included as primary keys
  • remove the word CLUSTER
  • change the word CLOB to LONGTEXT and delete other stuff on that line
  • my import file had a null grant_id, so: remove the constraint on the grants table that grant_id be null. Add an auto increment rowid to that table to use as a primary key instead of pmid+grant_id. Add pmid+grant_id as an index

And what’s up with the spacemission table???

It took me a day from beginning to end, with rusty Java and SQL. Huge thanks to the authors for this resource, and to Jon Lustgarten for convincing me that it was worth the tangent to start using Eclipse.

Quosa usage can violate PubMed Central terms of service

Filed under: Uncategorized — Tags: , , — Heather Piwowar @ 11:20 am

Has anyone else had a problem with Quosa and PubMed Central?

Quosa sounds great.  “Full-text journal workflow solutions.”  Exactly what I need, poof, no custom code required.

I downloaded the free demo.  I wanted some full-text for text mining: a large number of articles from years ago… they are available for free on the publisher’s website.  Quosa was simple to get going, and the first articles were retrieved with no problem.  But soon, lo and behold, rather than the articles I was getting a webpage with a message from PubMed Central:  IP blocked due to Bulk downloading of content.  Aaaaah!

I’m aware of PMC’s policy and have designed custom download methods to respect it (while working to get it changed).  I didn’t realize that Quosa was downloading the articles from PMC.  In retrospect it is clear, because the early articles it retrieved were in PMC format, though that was not obvious from the download interface.  Most people aren’t going to be aware of PMC’s restrictions, and so won’t know to look for the problem.

Quosa doesn’t seem to have any mechanisms by which one can specify the source of the articles (so that I could have requested they be downloaded from the publisher’s website).   I must admit in my late-evening haste I didn’t read the licencing agreements when I installed the demo… perhaps there were some warnings there?  The issue doesn’t seem to be covered in their FAQ.

Hmmmmm.  No more Quosa for me.  I’ll email them and ask for clarification, for the record.  I’ve also emailed PMC to ask for forgiveness and an unblocking.

Older Posts »

Blog at WordPress.com.