August 7, 2007

Summary of ONS BoF at ISMB

Quick summary of the Birds of a Feather on Open Notebook Science (ONS) at ISMB:

The session was during a short lunchtime on the last day, thus not ideal for high attendence.  Nonetheless, about 10 young scientists attended (including Frank and Matt, what fun to meet in person), and we had an interesting discussion.

  • many of us, but not all, had heard of ONS previously.  Nobody doing it.
  • challenges specific to bio/biomed/informatics:
    • sharing details about invasive animal experiments on the open internet could (and has) lead to harassment
    • privacy issues with clinical data (comment:  would first be fired then sued)
      • maybe ways around this, share lots but not everything, model how it is handled in publications, etc
  • general points:
    • more “errors” are bound to be found, will need a new publishing paradigm to deal with this
    • process for assessing research is disjoint from these practices, though changes are underway
    • only valid in areas where no potential commercial benefit, otherwise universities won’t allow?
    • might encourage informal peer review, thus raising the quality of  submissions and helping the investigators
    • young investigators just can’t risk being scooped
      • time-marked stake in the sand a reasonable defense?  Parallels with patent law.
      • will need social change
      • fear of scooping perhaps more pervasive than it occurring
        • yet a first-hand example in the room of being scooped from a rejected grant application
      • flip side:  if someone realizes that other good investigators are already working on something and n months ahead, they may forgo it and do something else
      • flip side:  potential collaborations
      • thoughts that the benefits may start outweighing the risks after the work is already well underway, as opposed to just being started
  • at the end of the session (generalizing) most in the room felt that ONS is interesting to think about, hard to pull off, perhaps possible as small steps, social change required.

Thanks to everyone who attended.  I enjoyed meeting you, and learned from the different perspectives in the conversation.

July 24, 2007

Conversation with BMC on Open Notebook Science

Wow, fantastic. I just had a conversation with Matt Hodgkinson, Senior Editor of the BMC series, which was worth the trip to Vienna all by itself.

While taking a break this morning from the ISMB talks and note-taking, it occurred to me that perhaps the best prep I could do for the ONS BoF was to talk to the journal publishers, all of whom happened to be standing a few feet away from me in their booths. Since I’m wearing my lovely free “I’m open” swag tshirt from BioMed Central (BMC) today, and I figured they’d be friendly to the cause, I started there.

Matt Hodgkinson will be familiar not only as an editor at BMC, but also as the author of the blog Journalology (“Science publishing trends, ethics, peer review, and open access”). I really enjoyed our informative discussion, Matt, thank you! As you read this, please feel free to clarify or add anything I’ve forgotten.

The bottom line: BMC has no hesitation considering research which has been previously posted to personal websites, blogs, wikis, and pre-print servers (as part of Open Notebook Science or otherwise), as long as it has not also been published in some formal way.

The details: Formal publishing is of course slightly difficult to nail down (they used to say “anything with a doi”, but now Nature Precedings has a doi without being considered a formal publication). A rule of thumb may be “anything with an ISSN.” Peer-review, or being indexed by PubMed, are not relevant to BMC when ascertaining prior formal publication status. Posters and abstracts are ok, conference proceedings are usually considered formal publications. Again, pre-print servers (Nature Precedings, arXiv) are fine.

Our conversation also touched on publishing clinical trial data and protocols, negative results, the fact that publishers can and do help recover data from authors who don’t respond to reader requests, the BMC policies for data sharing relative to that of other journals, and the potential for publishing about ONS. Unfort, no time to go into details now…

Once again, thank you, Matt, for your enthusiasm and time. I’m off next to talk to the folks at PLoS.

Messy Notes on Open Notebook Science

In anticipation of the ISMB BoF session on Open Notebook Science (ONS),
I’m trying to come up to speed on ONS discourse.  In between ISMB
sessions, I’ve started consolidating snippets of blog
posts and articles discussing ONS into a single document (in the open here).  It obviously relies heavily on work from Bill Hooker and Jean-Claude Bradley:  Thank you.   Following the advice to
“make a mess in your zero draft,” the current version isn’t very good
reading.  I have many more links to comb, and then I’ll start pulling it
together and making a first-draft (aka human-readable version).  I’ll
post again once it gets to that state.

July 18, 2007

Thanks to the NSF for travel grant

Public thank you to the NSF for my travel-fellowship to Vienna. I didn’t receive one in the first-round, but was on the waiting list and supposedly a number of the first-choice applicants weren’t able to attend, so I got to move up in the queue. $1100 to put towards the trip, thanks to the US National Science Foundation. Fantastic. Thank you.

It certainly helps, especially since I’m not ready to be away from my 15-month old kiddokiddo-15months

for a week just yet, so we are on a whole-family outing during a time of peak airplane-ticket prices.

Hopefully this will encourage others to think big about conference attendance and travel-grants. They are out there, it doesn’t hurt to apply, you never know!

ISMB 2007 BoF: Open (Notebook) Science

There will be a Birds of a Feather session at ISMB 2007 about Open (Notebook) Science.  It was initiated by yours truly, not because I’m an expert (I’m not!) or even because I have any real experience doing Open Notebook Science (I don’t!), but because I’d like to meet others who are interested and have a good conversation.  Sounds like a BoF to me!

So if you are at ISMB and available Wednesday at lunch, stop on by.

ps Thanks to Bill Hooker for his great summary, and to
all these people blogging about the Open Science Notebook [neat], and especially all those people who are really doing it.

Description: Open (Notebook) Science — the practice
of freely and openly sharing the process, data, tools, and results of
our research — is gaining momentum. For a nice overview, see
BOF for people doing, considering, or curious about Open Science.

Also note another BoF of interest, on Tuesday:
Data and Software Sharing     
Barb Bryant   
[Vice President of
the International Society for Computation Biology (ISCB)]

This session will explore options for Data and Software Sharing and is open to all to provide feedback to ISCB.

ISMB Poster: Examining the uses of shared data

I’m longing to catch up with reading and posting and commenting, but it will have to wait a bit longer. I’m packing to go to Vienna, for the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB)
& 6th European Conference on Computational Biology (ECCB).

I’m presenting a poster. It shows some preliminary results of looking at re-use patterns for microarray data in the PubMed Central literature.  It is up on Nature Precedings (yup, prior to the conference — Nature and ISMB both a-ok with it):

Poster G20
Examining the uses of shared data
Heather Piwowar & Douglas Fridsma
University of Pittsburgh

Does your research area re-use shared datasets?

  • Re-using data has many benefits, including research synergy and efficient resource use
  • Some research areas have tools, communities, and practices which facilitate re-use
  • Identifying these areas will allow us to learn from them, and apply the lessons to areas which underutilize the sharing and re-purposing of scientific data between investigators

Which datasets?
This preliminary analysis examines the re-use of microarray gene expression datasets.
Thousands of microarray gene expression datasets have been deposited in publicly available databases.
Many studies reuse this data, but it is not well understood for what purposes. Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the phrases “microarray” and “gene expression” to find studies which re-used microarray data.

How did we identify re-use?
We developed prototype machine-learning classifiers to identify a) studies containing original microarray data (n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK) extracted manually-selected keyword frequencies from the full-text publications as features for a Support Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents (PLoS articles prior to January 2007 containing the word “microarray,” n=200).

How did we identify patterns of re-use?
We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific MeSH term would be used given all studies with original microarray data, compared to the odds of the same term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy.

Publications with original vs. re-used microarray data have different distributions of MeSH terms (Figure 1), and occur in different proportions across various journals (Figure 2).
Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.5) compared to publications with original data.
Trends in odds ratios of MeSH terms for other attributes can be seen in Figure 3.

Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.

Future Work
We plan to refine our tool for identifying studies which re-use data, and continue studying and measuring re-use and reusability.

NOTE: typo in previous versions of the Nature Precedings abstract (should be OR<0.5 not OR<0.05).

I feel this is a slightly interesting, hypothesis-generating piece of preliminary work.  I think that it contributes most in raising the issue of data re-use.  I do hope to refine my “automatic reuse identifiers” and dig into the details and validation a bit more.

Comments and feedback welcome and encouraged, especially to help me understand if others find this interesting.

Edited to add a bit of content and update the version url.   Question:  does editing my posts do bad things to people getting them via RSS feed?  If so, please let me know.

May 1, 2007

Open Data at ISMB 2007?

I was expecting the accepted papers for ISMB 2007 to include a few related to Open Science or Open Data. I suppose ISMB is more about the biology than the science or engineering behind the research. Nonetheless, it could have been a worthwhile mix… many benefits when the users and the builders get together.

My current interest in Open Data is the sharing and reuse of raw clinical trial (specifically cancer, microarray) data. I attended ISMB 2005 and found it fascinating, which led to my plans to attend ISMB 2007. Well, and Vienna is beautiful and has great cycling. So it looks like my focus for the conference will be to hear about state-of-the-art microarray research, and even more, studies which use heterogeneous datasets.

Molecular biology has been a leader in open-access data with its sequence and protein databases, so I’m sure there will be a lot to learn.

A few papers which looked particularly interesting to me, as excerpted from ISMB’s list of accepted papers:


Accepted Paper Number: 146
Title: Annotating Gene Function by Combining Expression Data with a Modular Gene Network
Author(s): Motoki Shiga, Ichigaku Takigawa and Hiroshi Mamitsuka


Accepted Paper Number: 184
Title: Biases induced by pooling samples in microarray experiments
Author(s): Tristan Mary-Huard, JeanJacques Daudin, Michela Baccini, Annibale Biggeri and Avner Bar-Hen


Accepted Paper Number: 220
Title: Kernel-based data fusion for gene prioritization
Author(s): Tijl De Bie, Leon-Charles Tranchevent, Liesbeth van Oeffelen and Yves Moreau

Accepted Paper Number: 264
Title: Information Theory Applied to the Sparse Gene Ontology Annota-tion Network to Predict Novel Gene Function
Author(s): Ying Tao, Lee Sam, Jianrong Li, Carol Friedman and Yves A. Lussier


Accepted Paper Number: 271
Title: Identification of New Drug Classification Terms in Textual Resources – a Case Study
Author(s): Corinna Kolarik, Martin Hofmann, Marc Zimmermann and Juliane Fluck


Accepted Paper Number: 282
Title: Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks
Author(s): Wei Keat Lim, Kai Wang, Celine Lefebvre and Andrea Califano


Accepted Paper Number: 342
Title: Using genome context data to identify specific types of functional associations in pathway/genome databases
Author(s): Michelle Green and Peter Karp


Accepted Paper Number: 363
Title: Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations
Author(s): Jim Huang, Anitha Kannan and John Winn


Accepted Paper Number: 436
Title: Manual curation is not sufficient for annotation of genomic databases
Author(s): William Baumgartner, Lynne Fox, George Acquaah-Mensah, K. Bretonnel Cohen and Lawrence Hunter

If anyone else is planning to attend ISMB 2007, please let me know — it would be fun to meet up. A data sharing/reuse birds-of-a-feather session, perhaps?

