The third installment in my #draftInProgress series on Open Data citation advantage. I reread the methods description in my Who Shares paper and decided to just excerpt it directly for the method details (related thoughts on self-plagiarism and OA).
Methods: assessment of data availability
The independent variable of interest in this analysis is the availability of gene expression microarray data. Data availability had been previously determined for our sample articles in Piwowar 2011, so we directly reused that dataset [Piwowar Dryad 2011]. This study limited its data hunt to just the two predominant gene expression microarray databases: NCBI’s Gene Expression Omnibus (GEO), and EBI’s ArrayExpress.
“An earlier evaluation found that querying GEO and ArrayExpress with article PubMed identifiers located a representative 77% of all associated publicly available datasets [Piwowar 2010]. [We] used the same method for finding datasets associated with published articles in this study: [we] queried GEO for links to the PubMed identifiers in the analysis sample using the “pubmed_gds [filter]” and queried ArrayExpress by searching for each PubMed identifier in a downloaded copy of the ArrayExpress database. Articles linked from a dataset in either of these two centralized repositories were considered to have [publicly available data] for the endpoint of this study, and those without such a link were considered not to have [available] data.” [Piwowar 2011]
Piwowar H, Chapman W (2010) Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers. J Biomed Discov Collab 5: 7–20.
Piwowar HA (2011). Data from: Who shares? Who doesn’t? Factors associated with openly archiving raw research data. Dryad Digital Repository : http://dx.doi.org/10.5061/dryad.mf1sd
Piwowar HA (2011). Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS ONE, 6 (7) : http://dx.doi.org/10.1371/journal.pone.0018657