Research Remix

May 30, 2011

Prelim Finding the holdouts: Who is Required to publicly archive data but still doesn’t?

Filed under: Uncategorized — Heather Piwowar @ 10:44 am

I just can’t let another ASIS&T go by without submitting a research paper. I really like the ASIS&T annual meetings. They are welcoming, fun, interesting, and right up my research interest alley.

For what it is worth, I don’t actually think that presentations of peer-reviewed papers at conferences are a great use of anyone’s time. Posters, lightning talks, birds-of-a-feather breakouts, speed networking, tweetups, hackathons, and strange icebreakers make much better use of the precious face-to-face opportunity. That said, I only want to change the world one piece at a time and conference identity isn’t at the top of my list. I value being part of the ASIS&T community.  Submission it is! Due tomorrow. Here’s what I’ve got so far.

Idea
Pull together an interesting and useful analysis quickly, based on a subset of my thesis data.

Working title
Finding the holdouts: attributes of investigators who fail to publicly archive data even when required

Plan
Identify the subset of (gene expression microarray) data-producing studies that were published:

  1. in journals that require public data archiving of gene expression microarray data
  2. recently (since 2007)
  3. by a first and last authors who haven’t publicly archived gene expression microarray data before, as far as I can determine (first and last authors are the main decision makers in biomedical publications)

Within this subset, see which attributes of the investigators, their study, their funding, and their journal are associated with the proportion of publications with datasets in public data archives.

Who cares?
Knowing who is least likely to publicly archive data, even when required to do so, could help us uncover unrecognized hurdles in data archiving tools, policies, practice, and culture. The results could be used to study these issues and direct educational and policy resources as necessary.

The results of this analysis study should be considered hypothesis-generating because the sample size is relatively small for the number of attributes examined. The trends discovered can inform future surveys, interviews, and focus groups that seek to understand these issues in more depth.

Data and code
Data was previously collected and archived, and is available at:
ResearchBlogging.org
Piwowar HA (2011). Data from: Who shares? Who doesn’t? Factors associated with openly archiving raw research data Dryad Digital Repository : doi:10.5061/dryad.mf1sd

Prelim statistical analysis code is at https://gist.github.com/999040

Preliminary results

There are 1379 studies data-producing studies published in 2007-2009, in journals that require public data archiving, and by authors who hadn’t publicly archived this datatype before.  Of these, 491 (36%) were found to have associated datasets in the best practice public data archives.  Accounting for datasets missed by the automated tracking methods, the true level of data archiving is probably about 45%.

The only attributes found to be associated with archiving status at an unadjusted-for-multiple-comparisons p<0.05 level:

"last.author.num.prev.microarray.creations"           "1.08208619125454e-06"
"country.japan"                                       "0.000189421842358658"
"num.authors"                                         "0.000428626736381222"
"years.ago"                                           "0.000626355242287226"
"first.author.num.prev.microarray.creations"          "0.00520844140735418"
"pubmed.is.bacteria"                                  "0.0141373848532926"
"pubmed.is.cancer"                                    "0.0152110555075599"
"pubmed.is.shared.other"                              "0.0155174619349908"
"journal.policy.general.statement"                    "0.0176644417971739"
"last.author.num.prev.geo.reuse"                      "0.0244709126331953"  
"pubmed.is.core.clinical.journal"                     "0.0319486784601339"

(chi-sq test for binary and categorical variables, rank sum test for continuous and ordered variables)

To see the order of the association for these attributes, and all the other attributes, here are some plots.  The points represent the mean proportion of studies for which I found publicly available datasets, for the subset of studies that had  the attribute values displayed on the left.  The blue bars are 95% confidence intervals.


Discussion

Off to go write it :)  Observations:

So it seems the specific words in a journal policy that requires data archiving doesn’t matter much, though policies that include a general statement about data sharing and request the sharing of other datatypes have higher rates of data archiving.  The highest-impact journals that require data archiving have slightly higher archiving rates than those with impact factors between 4 and 7.  Mentioning exceptions in a journal policy may be associated with increased rates of archiving.  Core clinical journals tend toward high rates of data archiving (likely overlap with the high impact factor journals).

Disheartening to see again that studies about cancer are least likely to publicly archive data, even when required.  Some disciplinary trends:  studies on bacteria more likely to follow journal mandates.  Perhaps related:  studies that archived other types of data were more likely to also archive gene expression microarray data.

The number of PubMed Central citations received by a study wasn’t obviously correlated with whether it adhered to journal data archiving requires in this univariate analysis.

Funding doesn’t seem very related, either by source (within the NIH) or amount.  This is also true for studies which I estimate to have needed to submit data management plans as part of their NIH proposals (num.post2004.morethan1000k).  There is some evidence that studies with fewer NIH grants were less likely to archive their data.

Studies with a corresponding author address in Japan were much less likely  than others to archive data when required.  Other attributes of the institution did not seem to correlate much with archiving when required.

Studies with fewer authors were less likely to archive data when publishing in journals that require it.  There was no general trend with author age or experience, except that first and last authors who have created gene expression microarray datasets before and not yet archived any (by my reckoning) were LESS likely to share their current dataset than authors who apparently are publishing their first such dataset (first/last.author.num.prev.micro.produce).  Authors who have reused gene expression microarray data before are very much more likely to publicly archive their data when required to do so (though the N is small in this study, so the error bars for this effect are very wide).

Note, this work of course has limitations and caveats.  Many are similar to those of the associated data collection study.  Check it out, or ask.  Here’s a quick draft of some of the big ones:

This study does not attempt to measure public sharing of gene expression data wherever it may be on the Internet, but instead focuses on the two best-practice repositories, GEO and ArrayExpress. It did not include data listed in journal supplementary information, on lab or personal web sites, or in institutional or specialized repositories (including the well-regarded and well-populated Stanford Microarray Database).

Analyzing data sharing through bibliometrics allows investigations at a large-scale and avoids survey response self-selection or reporting bias. However, this approach does suffer its own limitations, as discussed in [PLoS ONE]. In particular, automated filters for identifying microarray creation studies do not have perfect precision, so some non-data-creation studies may be included erroneously included in the analysis. Furthermore, the methods overlook data deposits in GEO or ArrayExpress if they do not have PubMed identifiers in their dataset records. Due to these limitations, care should be taken in interpreting the estimated levels of absolute data sharing and the data-sharing status of any particular study listed in the raw data.  It is believed that any errors or misclassifications are randomly distributed across attributes, but this is not know for certain.

Associations do not imply causation.

Because of the number of variables examined and the wide confidence intervals on many of the findings, the results of this subset analysis should be considered hypothesis-generating and used to inform additional investigations.

The policies were classified based on written Instruction to Author statements only.

Will post a link once I have a full preprint!

Blog at WordPress.com.

%d bloggers like this: