The Tracking 1000 datasets project has generated a lot of interest. Note to readers: blog about your research! Post your proposals! You’ll be glad you did.
So far we’ve been laying the groundwork for identifying reuse: evaluating candidate repositories, choosing 100 random datasets from each repository, and gathering the citation history of the associated publications.
I’m very lucky to have help with the time-consuming data collection steps, thanks to UBC undergrad research-assistant Estephanie Sta Rosa. Thanks Estephanie!
Here are the 10 chosen repositories:
- Geochemistry of Rocks of the Oceans and Continents (GEOROC)
- Gene Expression Omnibus
- Protein Data Bank
- Nucleic Acids Database
- Biological Magnetic Resonance Data Bank
- A social science repository combo: ICPSR publication-related datasets, IQSS Dataverse publication-related datasets, and UK ESDS Qualidata. These are important repositories with have much in common, but fewer than one hundred 2005 deposits each
- A journal-hosted replication data combo: A subselection from journals that require and host replication data. These are mostly in economics and international studies
I’ve learned a lot from the selection and querying process. In particular:
- Querying for dataset submissions based on publication-date is simple in some repositories and not others. In several instances I’ve contacted the repository’s helpdesk for advice. The helpdesks have been universally helpful, and often gone far out of their way to provide me with what I was looking for. Thanks, helpdesks and other contacts!
- Google (using the site: facet) is often a more flexible/effective approach for searching within databases than the repository web interfaces themselves.
- The database I’ve had the most trouble querying? Genbank. I have a few leads (thanks biostar!), so I might yet figure it out and Genbank may knock ArrayExpress off the short list. But WOW the Genbank tools are really not designed for this type of query.
- In some repositories it is very difficult to determine the date of deposit. I use date of article publication as an imperfect proxy.
- I debated whether to include data repositories that were only loosely connected with publications, like ORNL DAAC. I’ve finally decided to stick firmly with the requirement that the datasets be publication-related, primarily to provide commonality across the whole sample.
- Biology and biomededicine seem to dominate the publication-related space, as far as I can tell. Repositories in other disciplines often house data collected in government labs, for example… important, but out of scope for this study. Please let me know if I’m missing a great candidate from another discipline.
- I can’t find any institutional data repositories that were accepting a large number of publication-related datasets in 2005, other than Dataverses and ICPSR.
- I severely underestimated the number of citations that each of the dataset articles would receive! I assumed that many of the publications would be uncited, but that is not true for these papers. To address this, I’ll manually curate just a subsample of the citations.
- Traditional citation-tracking tools, ISI Web of Science and Scopus, do not currently support tracking citations to datasets themselves within references sections (details). I’ll use Google Scholar for this instead.
Detailed manual classifications of reuse instances will begin this spring. I’ll keep you updated!