Hello dear neglected blog. I’ve been busy working on various things and posting about none of them, unfort. Just saw a post by Jean-Claude Bradley that has spurred me to action, and reminded me about why open science and open halfway-done-results are useful… you never know who might be thinking about similar things.
To wit: I spent a few weeks earlier this month experimenting with Mechanical Turk. I was planning to use it to get annotations for my thesis… then lo and behold, a ready made gold standard arrived on my doorstep, so unfort I’m putting further experimentation on the back burner.
As part of the early fiddling, though, I did dig out a bunch of lovely references about Mechanical Turk, draft a few paragraphs that were going to go in my proposal, run a pilot study, and write some R code to estimate generalizability coefficient. Maybe this will be useful to someone?
In case you aren’t familiar, a generalizability coefficient is a measure of reliability for annotations. A rule of thumb is that a gen coef of 0.7 is good if you are going to use the annotations to evaluate a system, but >0.9 is necessary if you are going to use the annotations to refine the system on a case-by-case basis. Hripsek et al. wrote a nice paper about this. I’ve refined their approach a bit to account for the fact that when using Mech Turk you don’t have a fixed set of annotators across all the questions, based on the excellent tutorial by Michael Brannick here on the Shrout and Fleiss approach.
Warning: I’m not a stats expert and this code is rough, so no guarantees… but I think it is a valid approach. Critiques and thoughts? Leave a comment, I’d love to hear them.
Ok, with that preamble, here’s a raw cut and paste of my work to date. Let me know if it isn’t understandable… apologies for lack of comments, etc. I’ll maybe try to pretty up the descriptions and code a bit and submit to Nature Precedings so that it is citable, at some point when I have time. In the mean time, better shared in a dirty state then hidden on my harddrive, right?
For what it is worth… Mechanical Turk looked like a really promising approach for my application :)
Generalizability Coefficient for Mechanical Turk Annotations
Rough Draft, Dec 2008
Amazon’s Mechanical Turk for annotation
It is challenging to assemble an unbiased reference standard quickly and inexpensively. Several recent studies have explored using Amazon’s Mechanical Turk, a distributed online micro-market, to annotate text[1-5]. Their experiences suggest that this approach to collecting and consolidating non-expert labels is a feasible method for constructing a useful annotation corpus.
While some data is very noisy, the invalid responses are due to a small minority of users and do not constitute a major problem. Snow et al. estimates that four non-exert labelers per item are sufficient to emulate expert-level label quality for an affect recognition task. Some research has demonstrated methods for reweighting and developed recommendations for how to develop a good task.[6, 7]
Since a typical job (or HIT, for Human Intelligence Task) typically pays only a few cents, collecting annotations this way can be very cost effective. However, there are a few reasons not to pay too little: low payments may lead to sloppier annotations, ethical concerns about offering a fair wage , and tasks with very low bids simply won’t be picked up by workers (Turkers).
Pilot annotation study with Amazon’s Mechanical Turk
We conducted a pilot annotation study with Amazon’s Mechanical Turk to estimate the accuracy with which annotation tasks can be performed by this group of non-experts, the number of independent annotations necessary to get sufficient generalizability, and the cost of annotation.
We selected the task of identifying which papers about microarray data generated microarray data, since it is more nebulous, holistic, and challenging than identifying statements of data sharing. We selected 10 random articles from PubMed Central that contained the words gene, expression, and microarray. We provided background information about the nature of gene expression experiments and criteria for what we consider a paper that includes a gene expression microarray experiment to include. We then provided a link to PMC and asked a) whether the paper included the running of a gene expression microarray experiment, and b) to cut and paste an excerpt from the paper that supports their choice.
The results had high accuracy with my “expert” annotations. The majority vote of the five annotations was the same as my annotations in 9 of 10 cases, indicating a 90% accuracy, albeit with a very wide conference interval going from 0.54 to 0.99. A chi-squared test between the majority vote and my annotations was not significant (p=0.07). As another validation, the majority vote identified all 4 of the 10 articles with links from GEO as “dataset creating” and the 2 of 10 articles in bioinformatics journals (a proxy for non-dataset producing) as not dataset creating. This preliminary evidence supports the validity of the Turker annotations.
To establish the reliability of the annotations, I calculated the generalizability coefficient for random annotators. The calculation approach parallels that outlined by Hripcsak et al. , but assumes that each question is answered by a different, randomly-selected set of annotators. Using the formulas of  (helpfully interpreted at ), the pilot generalizability coefficient if 0.87 and the estimated number of similar annotations to achieve various target generalizability coefficients are given in the table below.
Table 1: Generalizability coefficients
Target generalizability coefficient | Estimated number of annotations per task required to achieve target, for the pilot task
Hripcsak et al.  proposes that for system evaluation a generalizability coefficient of 0.7 is sufficient. Our results suggest that 4 annotators would be sufficient for this task. This number is consistent with the findings of Snow et al.. We will likely choose 5 annotators, to facilitate a majority vote without ties.
We ran the experiment in two phases. In the first, we recruited 2 opinions per paper $0.50 per answer + $0.50 bonus for each answer that matched the “expert” opinion. In the second phase, we recruited an additional 3 opinions at $0.15 per answer + $0.10 bonus. Accuracy and accruement rates did not vary with payment.
In summary, we believe that using Amazon’s Mechanical Turk will be an efficient, accurate way to establish gold standards for this study. We anticipate gathering 5 opinions per article at a cost of $0.25 each.
Risks and Contingency Plans
I have contingency plans in case the annotation through Mechanical Turk is very noisy. I would refine the annotation task to provide gold standard feedback, use a qualification test, or weight labelers based on accuracy.
1. Sheng, V., F. Provost, and P. Ipeirotis, Get Another Label? Improving Data Quality and Data Mining. archive.nyu.edu, 2008.
2. Snow, R., et al., Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. Proceedings of EMNLP-08, 2008.
3. Nakov, P., Paraphrasing Verbs for Noun Compound Interpretation. Proc. of the Workshop on Multiword Expressions, 2008.
4. Li, B., et al., Exploring question subjectivity prediction in community QA. Proceedings of the 31st annual international ACM SIGIR …, 2008.
5. Yakhnenko, O. and B. Rosario, Mining the Web for Relations between Digital Devices using a Probabilistic Maximum Margin Model. Proceedings of the Third International Joint Conference on Natural Language Processing, 2008.
6. Kittur, A., E.H. Chi, and B. Suh, Crowdsourcing user studies with Mechanical Turk. CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, 2008.
7. Sorokin, A. and D. Forsyth, Utility data annotation with Amazon Mechanical Turk. Computer Vision and Pattern Recognition Workshops, 2008.
8. Pontin, J., Artificial Intelligence, With Help From the Humans. The New York Times, 2007(March 24, 2007).
9. Hripcsak, G., et al., A Reliability Study for Evaluating Information Extraction from Radiology Reports. Journal of the American Medical Informatics Association, 1999.
10. Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 1979.
11. Brannick, MT. Shrout and Fleiss Computations for Intraclass correlations for interjudge reliability. URL:http://luna.cas.usf.edu/~mbrannic/files/pmet/shrout1.htm. Accessed: 2008-12-29. (Archived by WebCite® at http://www.webcitation.org/5dQr5D5ta)
12. Dakka, W. and P. Ipeirotis, Automatic Extraction of Useful Facet Hierarchies from Text Databases. Data Engineering, 2008.
Appendix: R code for generalizability coefficient
## Code by Heather Piwowar, December 2008
## Approach inspired by Hripcsak G, Kuperman GJ, Friedman C, Heitjan DF. A reliability study for evaluating information extraction from radiology reports. J Am Med Inform Assoc 1999;6:143–50.
## Equations from the helpful tutorial at the webpage Shrout and Fleiss Computations for Intraclass correlations for interjudge reliability. URL:http://luna.cas.usf.edu/~mbrannic/files/pmet/shrout1.htm. Accessed: 2008-12-29. (Archived by WebCite® at http://www.webcitation.org/5dQr5D5ta)
## Seminal paper: Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 1979.
turk = read.csv(“Documents/Thesis/Mechanical Turk/ForGeneralizability.csv”)
turk$PMCID = as.factor(turk$PMCID)
model.turk = lm(answer~PMCID+WorkerId,data=turk)
aov.turk = anova(model.turk)
Annotators.MS = aov.turk$”Mean Sq”
error.MS = aov.turk$”Mean Sq”
questions.MS = aov.turk$”Mean Sq”
Annotators.df = aov.turk$”Df”
questions.df = aov.turk$”Df”
# just 1 for random annotators
ICC.21 = (questions.MS – error.MS)/(questions.MS + (Annotators.df*error.MS)+ (Annotators.df+1)*(Annotators.MS – error.MS)/(questions.df + 1))
# fyi, all k for random annotators
#(questions.MS – error.MS)/(questions.MS + (Annotators.MS – error.MS)/(questions.df + 1))
# number of judges required for a target generalizability
target.gen = seq(0.5, 0.9, by=0.1)
num.required = (target.gen*(1-ICC.21)) / (ICC.21*(1-target.gen))
# now compare to gold standard
tapply(turk$answer, turk$PMCID, mean)
answers.voted = round(tapply(turk$answer, turk$PMCID, mean))
answers.goldstandard = round(tapply(turk$goldstandard, turk$PMCID, mean))
twobytwo = table(answers.voted, answers.goldstandard)
correct = twobytwo[1,1]+twobytwo[2,2]
total = sum(twobytwo)
accuracy = (correct)/total
# example from webpage, for testing
#Annotators.MS = 32.48
#error.MS = 1.02
#questions.MS = 11.24
#Annotators.df = 3
#questions.df = 5
#### other approaches
# library(irr) # is only for each rater does all data
### code following Hripsak approach
Case.estvar = (questions.MS-error.MS)/(Annotators.df+1.0)
Resid.estvar = error.MS
ReliabRel.1 = Case.estvar/(Case.estvar+Resid.estvar)
N = 1:10
ReliabRel.n = Case.estvar/(Case.estvar+(Resid.estvar/N))
cbind(N, round(ReliabRel.n, 2))
#### code for fixed raters rather than random
# just 1 if fixed
(questions.MS – error.MS)/(questions.MS + (Annotators.df)*error.MS)
# all k if fixed
(questions.MS – error.MS)/questions.MS