I submitted a proposal to the Citizen Science Alliance yesterday. It is really exciting! I’d briefly discussed the idea of classify-citations-to-find-data-reuse with members of Zooniverse previously: they suggested that I keep the scope small to start and focus on an area of interest to the public, like cancer. Thanks to Todd Vision for great feedback on an last-minute draft.
Here are the main bits of the proposal. Feedback welcome!
Tracking the building blocks of cancer research
Funders and charities spend a lot of money on cancer research. Are we getting as much research progress we can get from this funding? How is cancer research used to make additional discoveries? Are there lessons we can learn from current behaviour to improve research efficiency in the future?
These questions are currently hard to answer because it difficult to identify which resources have contributed to new research. This project will invite volunteers to enrich the scientific literature by classifying literature citations to previous work. Researchers cite previous work for many reasons: only through examining citation context can we differentiate citations made to reference background information from citations that attribute the reuse of materials, methods, software, and datasets.
We propose to begin attribution tracking in a specific domain: cancer-related gene expression microarray datasets. By categorizing citations to papers that describe such datasets we will begin to understand how often these datasets are used by others, how scientists attribute reuse of these research building blocks, and what contribution the data has made to research progress. We hope these findings will help make similar information more easily discoverable in the future.
This is a great chance for citizen scientists to begin enriching the scientific literature. Subscription and licencing restrictions make most of the scientific literature off limits to automated markup. Human-aided text mining of freely available full text has the potential to extract a coherent and incredibly useful body of knowledge from literature that is otherwise unavailable for large-scale markup.
Volunteers will be given a link to an article in PubMed Central and text that identifies a specific item in that article’s References section. They will be asked to find all mentions of the Reference item in the article full text, record the section, extract the sentences surrounding the citation, and classify the citation context into one of a few broad categories (cited for background information, to attribute use of method, to attribute use of data, etc). Finally, the volunteers will be asked to make an assessment about whether the cited resource played a significant role in the conduct of the reported research project.
We think questions can be phrased so as not to require specialist knowledge.
Minimum requirements for success would be annotation of 1000 citations since about 10% are likely in context of data reuse. Annotation of all citations would allow a more through analysis of patterns.
Describe the nature of the data that would be used in the proposed project – include (a) its format (filetype, size, number of files) (b) any restrictions (including copyright) on its use (c) its availability (is it archive data or still being collected).
The input data is simply a list of pairs: a URL that points to a paper in PubMed Central paired with a text string that identifies (authors, title, year, journal) a reference item known to be cited within the PubMed Central paper.
The URLs and citation text would be derived from a dataset currently hosted on the Dryad repository:
Piwowar HA (2011) Data from: Who shares? Who doesn’t? Factors associated with openly archiving raw research data. Dryad Digital Repository. doi:10.5061/dryad.mf1sd
Specifically, PubMed IDs for cancer gene expression microarray studies published in 2005 (n=792) would be extracted from the abve dataset. An equal number of studies that made their data publicly available (n=129) and studies without publicly available datasets would be retained (total n=258, http://tinyurl.com/7mkbpyq). PubMed Central reports hosting 3540 papers that cite these 258 studies. The 3540 citing papers would be randomized and a subset of 1000 would be extracted to achieve our initial analysis goals.
What automatic processing routines exist which attempt to solve the problem being addressed? Why can’t they be used instead of humans?
Automated processing schemes have been developed to classify citation context (e.g. Teufel et al. Automatic classification of citation function. EMNLP 2006). It is not known how accurate these algorithms are for the specific task of identifying data reuse attribution.
The primary hurdle to automated processing is legal: publishers rarely allow full text to be harvested or used for text mining.
This proposal suggests an approach to work around these access and use limitations: leverage citizen scientists and publicly available research papers to gain large scale access to the scientific literature.
PubMed lists 804184 publications from 2009 with links to full text. Of these, 247421 (31%) have free full text, available for public view. Only a small subset of these, about 67000 (8%), are open access with full text that can be systematically downloaded and used for text mining. [http://researchremix.wordpress.com/2011/12/15/computing-availability-of-full-text-for-reuse/]
If possible, estimate the minimum number of times a task must be performed on a given element of data to be useful for science (assuming all tasks are performed by competent citizen scientists; once might be enough for exceptionally clear tasks, more times could be required for fuzzier tasks or lots may be necessary if accurate estimates of uncertainties are needed). How many total tasks must be completed before your research goals are achievable?
We believe three replicates would be sufficient, but a bit of experimentation may be needed to understand how many classifications are needed to achieve sufficient accuracy. A master’s student with a bachelor degree’s in forestry was able to complete the task accurately with little training. Five replicates achieved the necessary generalizability when we asked people on Mechanical Turk to complete a more complex task based on the same sort of papers in the biomedical literature (details: http://researchremix.wordpress.com/2008/12/29/generalizability-coefficient-for-mechanical-turk-annotations/).
Assuming three replicates would be sufficient, 1000 citations would require completion of 3000 tasks.
Are there potential extensions to the project that you have in mind?
Yes! We are excited about extensions to this project in at least three dimensions:
1. Data reuse estimates. Investigating instances of data reuse for additional years, domains and datatypes, to understand how patterns differ.
2. Open citations and citation context. The proposed project could be the first step in creating a repository of openly available citation information, ideally with semantic metadata. This would be of broad interest. As one concrete example, citation slices and dices could be included as measures of impact in http://total-impact.org.
3. Toll-access literature. Experiment with applying this approach to subscription-based literature, either by negotiating with publishers or by leveraging the subset of volunteers with university affiliations and subscriptions.
Even more profoundly, this mechanism for enriching the scientific literature will be of deep interest to a wide variety of researchers for all sorts of additional purposes.