Studying data sharing is great, but what we really want to know about is data reuse.
There are benefits to data sharing and archiving even if the data are never reused: for example, sharing detailed datasets likely discourages fraud. Other benefits may accrue but leave no evidence in the published literature: datasets may be used for education and training, or by interested readers to more deeply understand and validate the data-producing study.
The real value-add of data archiving, though, is in the potential for more efficient and effective scientific progress through data reuse. There have been many calls to quantify the extent and impact… to do a cost/benefit analysis. An estimate of value of reuse would help to make a business case for repository funding, an ethical case for compelling investigators to bear the personal cost of sharing data, and clarify that sharing even esoteric data is useful — as the Nature Neuroscience editorial Got Data? puts it, “After all, no one ever knocked on your door asking to buy those figurines collecting dust in your cabinet before you listed them on eBay.”
Little research has been done on the patterns and prevalence of reuse. A few superstar success stories need no analysis: Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.
They are so successful, though, that people discount them as special cases.
So what does the reuse behaviour look like for other datasets?
We don’t know. It is difficult to track reuse. There have been a few surveys, but they suffer from limited scope and self-reporting biases. I gather that download stats are poorly correlated with perceived value (I need to learn more about this). So let’s track reuse in the published literature.
Yeah. Well. Let’s. Unfortunately it won’t be simple, due to the lack of standards for data citations and the ambiguity of citation contexts:
- There is no standard identifier for a given dataset. The accession number? DOI? The citation of the data-producing paper? The author’s name? A search strategy or set of inclusion criteria for a set of datasets?
- Even when there is a standard identifier, or a small set of identifiers, there is no standard practice for referencing the identifier. As a mention in the methods? In the acknowledgements? In a supplementary table? Cited as an official reference?
- Our current tools for finding and extracting these identifiers are poor. If the mentions are in full-text, we need to do full text queries across a wide range of the published literature. Google Scholar, ISI Web of Science, PubMed Central, Scirus, HighWire Press… they all have serious drawbacks. Citations are easier, but data extraction from ISI Web of Science and Scopus is still very suboptimal and not machine friendly.
- Finally, intelligence is required in disambiguating the dataset mentions. Is the paper discussing a dataset deposit, a dataset reuse, or something else?
It isn’t simple to gather reuse patterns, but it is possible, so I’ve started. I’m keeping open notes on Open Wet Ware on my progress. Early days. Stay tuned.