Research Remix

December 29, 2008

Generalizability coefficient for Mechanical Turk annotations

Filed under: Uncategorized — Heather Piwowar @ 11:14 am

Hello dear neglected blog.  I’ve been busy working on various things and posting about none of them, unfort.  Just saw a post by Jean-Claude Bradley that has spurred me to action, and reminded me about why open science and open halfway-done-results are useful… you never know who might be thinking about similar things.

To wit:  I spent a few weeks earlier this month experimenting with Mechanical Turk.  I was planning to use it to get annotations for my thesis… then lo and behold, a ready made gold standard arrived on my doorstep, so unfort I’m putting further experimentation on the back burner.

As part of the early fiddling, though, I did dig out a bunch of lovely references about Mechanical Turk, draft a few paragraphs that were going to go in my proposal, run a pilot study, and write some R code to estimate generalizability coefficient.  Maybe this will be useful to someone?

In case you aren’t familiar, a generalizability coefficient is a measure of reliability for annotations.  A rule of thumb is that a gen coef of 0.7 is good if you are going to use the annotations to evaluate a system, but >0.9 is necessary if you are going to use the annotations to refine the system on a case-by-case basis.  Hripsek et al. wrote a nice paper about this.  I’ve refined their approach a bit to account for the fact that when using Mech Turk you don’t have a fixed set of annotators across all the questions, based on the excellent tutorial by Michael Brannick here on the Shrout and Fleiss approach.

Warning:  I’m not a stats expert and this code is rough, so no guarantees… but I think it is a valid approach.  Critiques and thoughts?  Leave a comment, I’d love to hear them.

Ok, with that preamble, here’s a raw cut and paste of my work to date.  Let me know if it isn’t understandable… apologies for lack of comments, etc.   I’ll maybe try to pretty up the descriptions and code a bit and submit to Nature Precedings so that it is citable, at some point when I have time.  In the mean time, better shared in a dirty state then hidden on my harddrive, right?

For what it is worth… Mechanical Turk looked like a really promising approach for my application :)

Generalizability Coefficient for Mechanical Turk Annotations
Heather Piwowar
Rough Draft, Dec 2008

Amazon’s Mechanical Turk for annotation

It is challenging to assemble an unbiased reference standard quickly and inexpensively.  Several recent studies have explored using Amazon’s Mechanical Turk, a distributed online micro-market, to annotate text[1-5]. Their experiences suggest that this approach to collecting and consolidating non-expert labels is a feasible method for constructing a useful annotation corpus.

While some data is very noisy, the invalid responses are due to a small minority of users[6] and do not constitute a major problem.  Snow et al. estimates that four non-exert labelers per item are sufficient to emulate expert-level label quality for an affect recognition task[2].  Some research has demonstrated methods for reweighting[2] and developed recommendations for how to develop a good task.[6, 7]

Since a typical job (or HIT, for Human Intelligence Task) typically pays only a few cents, collecting annotations this way can be very cost effective.  However, there are a few reasons not to pay too little:  low payments may lead to sloppier annotations[7], ethical concerns about offering a fair wage [8], and tasks with very low bids simply won’t be picked up by workers (Turkers).

Pilot annotation study with Amazon’s Mechanical Turk
We conducted a pilot annotation study with Amazon’s Mechanical Turk to estimate the accuracy with which annotation tasks can be performed by this group of non-experts, the number of independent annotations necessary to get sufficient generalizability, and the cost of annotation.

We selected the task of identifying which papers about microarray data generated microarray data, since it is more nebulous, holistic, and challenging than identifying statements of data sharing.  We selected 10 random articles from PubMed Central that contained the words gene, expression, and microarray.  We provided background information about the nature of gene expression experiments and criteria for what we consider a paper that includes a gene expression microarray experiment to include.  We then provided a link to PMC and asked a) whether the paper included the running of a gene expression microarray experiment, and b) to cut and paste an excerpt from the paper that supports their choice.

The results had high accuracy with my “expert” annotations.  The majority vote of the five annotations was the same as my annotations in 9 of 10 cases, indicating a 90% accuracy, albeit with a very wide conference interval going from 0.54 to 0.99.  A chi-squared test between the majority vote and my annotations was not significant (p=0.07).  As another validation, the majority vote identified all 4 of the 10 articles with links from GEO as “dataset creating” and the 2 of 10 articles in bioinformatics journals (a proxy for non-dataset producing) as not dataset creating.  This preliminary evidence supports the validity of the Turker annotations.

To establish the reliability of the annotations, I calculated the generalizability coefficient for random annotators.  The calculation approach parallels that outlined by Hripcsak et al. [9], but assumes that each question is answered by a different, randomly-selected set of annotators.  Using the formulas of [10] (helpfully interpreted at [11]), the pilot generalizability coefficient if 0.87 and the estimated number of similar annotations to achieve various target generalizability coefficients are given in the table below.

Table 1:  Generalizability coefficients

Target generalizability coefficient    |  Estimated number of annotations per task required to achieve target, for the pilot task
.5    2
.6    3
.7    4
.8    6
.9    13

Hripcsak et al. [9] proposes that for system evaluation a generalizability coefficient of 0.7 is sufficient.  Our results suggest that 4 annotators would be sufficient for this task.  This number is consistent with the findings of Snow et al.[2].  We will likely choose 5 annotators, to facilitate a majority vote without ties.
We ran the experiment in two phases.  In the first, we recruited 2 opinions per paper $0.50 per answer + $0.50 bonus for each answer that matched the “expert” opinion.  In the second phase, we recruited an additional 3 opinions at $0.15 per answer + $0.10 bonus.  Accuracy and accruement rates did not vary with payment.
In summary, we believe that using Amazon’s Mechanical Turk will be an efficient, accurate way to establish gold standards for this study.  We anticipate gathering 5 opinions per article at a cost of $0.25 each.

Risks and Contingency Plans
I have contingency plans in case the annotation through Mechanical Turk is very noisy.  I would refine the annotation task to  provide gold standard feedback[7], use a qualification test[12], or weight labelers based on accuracy[2].

1.    Sheng, V., F. Provost, and P. Ipeirotis, Get Another Label? Improving Data Quality and Data Mining., 2008.
2.    Snow, R., et al., Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. Proceedings of EMNLP-08, 2008.
3.    Nakov, P., Paraphrasing Verbs for Noun Compound Interpretation. Proc. of the Workshop on Multiword Expressions, 2008.
4.    Li, B., et al., Exploring question subjectivity prediction in community QA. Proceedings of the 31st annual international ACM SIGIR …, 2008.
5.    Yakhnenko, O. and B. Rosario, Mining the Web for Relations between Digital Devices using a Probabilistic Maximum Margin Model. Proceedings of the Third International Joint Conference on Natural Language Processing, 2008.
6.    Kittur, A., E.H. Chi, and B. Suh, Crowdsourcing user studies with Mechanical Turk. CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, 2008.
7.    Sorokin, A. and D. Forsyth, Utility data annotation with Amazon Mechanical Turk. Computer Vision and Pattern Recognition Workshops, 2008.
8.    Pontin, J., Artificial Intelligence, With Help From the Humans. The New York Times, 2007(March 24, 2007).
9.    Hripcsak, G., et al., A Reliability Study for Evaluating Information Extraction from Radiology Reports. Journal of the American Medical Informatics Association, 1999.
10.    Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 1979.
11.    Brannick, MT.  Shrout and Fleiss Computations for Intraclass correlations for interjudge reliability. URL: Accessed: 2008-12-29. (Archived by WebCite® at
12.    Dakka, W. and P. Ipeirotis, Automatic Extraction of Useful Facet Hierarchies from Text Databases. Data Engineering, 2008.

Appendix:  R code for generalizability coefficient

## Code by Heather Piwowar,  December 2008

## Approach inspired by Hripcsak G, Kuperman GJ, Friedman C, Heitjan DF. A reliability study for evaluating information extraction from radiology reports. J Am Med Inform Assoc 1999;6:143–50.
## Equations from the helpful tutorial at the webpage Shrout and Fleiss Computations for Intraclass correlations for interjudge reliability. URL: Accessed: 2008-12-29. (Archived by WebCite® at
## Seminal paper:  Shrout, P.E. and J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 1979.

turk = read.csv(“Documents/Thesis/Mechanical Turk/ForGeneralizability.csv”)
turk$PMCID = as.factor(turk$PMCID)

model.turk = lm(answer~PMCID+WorkerId,data=turk)
aov.turk = anova(model.turk)

Annotators.MS = aov.turk$”Mean Sq”[2]
error.MS = aov.turk$”Mean Sq”[3]
questions.MS = aov.turk$”Mean Sq”[1]
Annotators.df = aov.turk$”Df”[2]
questions.df = aov.turk$”Df”[1]

# just 1 for random annotators
ICC.21 = (questions.MS – error.MS)/(questions.MS + (Annotators.df*error.MS)+ (Annotators.df+1)*(Annotators.MS – error.MS)/(questions.df + 1))

# fyi, all k for random annotators
#(questions.MS – error.MS)/(questions.MS + (Annotators.MS – error.MS)/(questions.df + 1))

# number of judges required for a target generalizability
target.gen = seq(0.5, 0.9, by=0.1)
num.required = (target.gen*(1-ICC.21)) / (ICC.21*(1-target.gen))
cbind(target.gen, ceiling(num.required))

# now compare to gold standard
tapply(turk$answer, turk$PMCID, mean)
answers.voted = round(tapply(turk$answer, turk$PMCID, mean))
answers.goldstandard = round(tapply(turk$goldstandard, turk$PMCID, mean))
chisq.test(answers.voted, answers.goldstandard)
twobytwo = table(answers.voted, answers.goldstandard)
correct = twobytwo[1,1]+twobytwo[2,2]
total = sum(twobytwo)
accuracy = (correct)/total
prop.test(correct, total)

# example from webpage, for testing
#Annotators.MS = 32.48
#error.MS = 1.02
#questions.MS = 11.24
#Annotators.df = 3
#questions.df = 5

#### other approaches

# library(irr)  # is only for each rater does all data

### code following Hripsak approach

Case.estvar = (questions.MS-error.MS)/(Annotators.df+1.0)

#Raters.estvar =

Resid.estvar = error.MS

ReliabRel.1 = Case.estvar/(Case.estvar+Resid.estvar)

N = 1:10
ReliabRel.n = Case.estvar/(Case.estvar+(Resid.estvar/N))
cbind(N, round(ReliabRel.n, 2))

#### code for fixed raters rather than random

# just 1 if fixed
(questions.MS – error.MS)/(questions.MS + (Annotators.df)*error.MS)

# all k if fixed
(questions.MS – error.MS)/questions.MS

Appendix:  ForGeneralizability.csv



  1. Could you share your annotation data and gold standard?

    I’d like to add your experiment to my inter-annotator data set, because I’m trying to establish the robustness of Bayesian approaches to inferring gold standards, problem difficulties, and coder accuracies. I’m also releasing all my data along with the R and BUGS code on the LingPipe sandbox.

    Here’s a link to the blog entry linking to the paper about the models I’ve been using:

    Comment by lingpipe — December 30, 2008 @ 10:50 am

  2. Absolutely you may include it. Have at it. It sounds like an interesting problem!
    Let me know if you have any questions about the data, such as it is.

    Comment by Heather Piwowar — December 31, 2008 @ 9:02 am

  3. Was the data at the end of the R code all you collected? I’ll probably need a bit more data than that to reliably infer annotator accuracies.

    Is PMCID the article (PubMed?) ID?

    We’re going to be running lots of mechanical Turk jobs in the next few months (we have an intern coming who’ll spend most of her time on it). We could collect more of this kind of data — the problem’s very interesting because it’s so directly relevant to a researcher.

    Comment by lingpipe — January 2, 2009 @ 5:19 pm

  4. Yes, unfortunately that is all of the data I collected. As I was writing up the pilot study, I ran across a recent survey with 400 datapoints annotated by experts, so it was no longer necessary to derive such a dataset via MTurk.

    PMCID is the PubMed Central ID (related to but not the same as the PubMed ID).

    If you do have resources in search of a similar problem, I certainly have some I could suggest. For example, I’d access to have an annotated set of studies that REUSE datasets. A start is the list maintained by GEO… but it doesn’t have any true negatives. A reuse list would be interesting in and of itself as prevalence of data reuse (= benefits of data sharing), and could be used to evalutate an NLP engine for identifying reuse (which could then be used for a more systematic analysis of reuse patterns).

    That said, I’m sure everybody on the bioNLP mailing list would have their own ideas about annotation datasets they’d love to see ;)

    Anyway, let me know if I can be of more help.


    Comment by Heather Piwowar — January 4, 2009 @ 3:50 pm

  5. […] We believe three replicates would be sufficient, but a bit of experimentation may be needed to understand how many classifications are needed to achieve sufficient accuracy.  A master’s student with a bachelor degree’s in forestry was able to complete the task accurately with little training.  Five replicates achieved the necessary generalizability when we asked people on Mechanical Turk to complete a more complex task based on the same sort of papers in the biomedical literature (details:…). […]

    Pingback by Proposal inviting Citizen Scientists to enrich the scientific literature « Research Remix — December 16, 2011 @ 6:00 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at

%d bloggers like this: