Research Remix

April 27, 2011

The perfection hurdle

Filed under: Uncategorized — Heather Piwowar @ 10:25 am

We discuss many hurdles that investigators face when sharing their research datasets: cost, credit, and fear of misinterpretation and scooping, to name a few.  I think there is a stealthy hurdle a bit further out.  I’ve run smack-dab into it, and ouch it hurts my shins.

See, I have a bunch of projects that are almost done.  Papers accepted, proofs being proofed, now the publishers want the final data archiving urls.  Yay, right?  Right!  Except that now I actually have to archive the data.  I’m totally on board with this, in theory.  I think it is a good idea: in general, for me personally, for these projects specifically.  I’m not worried about cost, credit, or fear of misinterpretation or scooping.

Then why am I hesitating, why have I put it off till now?  Why, if I weren’t so committed to the cause, might I not do it at all?

My spreadsheets and my scripts, they just aren’t as elegant in real life as they are in my head.  My scripts need more commenting, my README needs more detail, my column names need more consistency.  I did try to follow best practices when I set them up, but that was many months ago and now I know better and I want to do better.

But after spending so much time getting the article text just right, caring about every silly detail in the bibliography, and doing somersaults to get the figures in the right format… the idea of upgrading one more set of research artifacts into a “published, ready to be archived forever, I’m proud of this” snapshot state feels daunting.  I want more time, I want more inspiration, I’m not ready.  Many researchers are perfectionists by nature, so I doubt I’m alone.

This issue is different than a lack of time or a lack of resources or fear of errors.  It is difficult to quantify its impact on prevalence of data withholding.  I have no doubt that it contributes to the relative willingness to share details with other investigators in a limited way, upon request, rather than in public for all to see.  On request feels less final.

So how to lower the height of this hurdle?  Examples and templates and guidelines and mentoring. Mandates and standards will help.  Releasing widely early and often.  Recognizing that creative output falls short of the aspired outputs all the time, and especially when people are new to something (check out the message in this video by Ira Glass).  Repeating that the perfect is the enemy of the good.  All of that.  All of the ways academics learn to deal with harmful perfectionism in other aspects of what they consider part of their job.

Nonetheless, as people who think about the challenges to data archiving, we ought to remember that perfectionism is unlikely to be volunteered as a reason for data withholding and yet probably makes a substantial impact, particularly before data archiving becomes standard practice.

I’m off to submit my datasets now.

5 Comments

  1. maybe what should rewarded by universities — tenure committees and such — is not only the fact that the final data set gets published. Instead, universities should reward the publication of the full “archeology of the data analysis.” This means that from day one I put my data analysis script on something like github — https://github.com/ — point the script to my data set living on a cloud, and let others fork and comment on my code. The publicity of the code would automatically protect my project as I would always have documented proof of this script and data being mine.

    Comment by Ricardo Pietrobon — April 27, 2011 @ 4:41 pm

  2. I’d like to suggest everyone read Hunt and Thomas’s gem of a book, The Pragmatic Programmer. It contains scads of useful advice, including the critical advice for solving this problem: scripting.

    The problem everyone is having is that they do things by hand rather than automating them. Thus when it comes time to reconstruct what they did, they have no clue, because there’s no record. There’s a bunch of scripts that have been changed incrementally.

    Instead, everyone should be scripting things so that they run pretty much end to end in what the biologists like to call a “pipeline”.

    You also need version control. Then, every time you run something, stamp it with the version control ID. That lets you go back to what you did at any point in the past very painlessly. And it really stops the biggest problems I see in “research” code: code sprawl both in single files (why delete when I can comment out?) and across files (the signs of which are sets of files named align, align_bk, align_bk2, align_bk_working_i_think,…).

    Code documentation is hugely overrated. It lies, for one thing, whereas the code itself always tells the truth. It’s always better to spend time making the code more self explanatory than writing comments for obscure code. If you have a script that does what you did end to end, it’s then up to the user to figure it out. No reason to front load all that work.

    If you must have comments, just comment higher-level blocks for intent.

    Comment by Bob Carpenter — April 28, 2011 @ 10:44 am

  3. Ricardo, that’s interesting. It would encourage sharing early, which would help. I’m afraid it is a long way off, though.

    Bob, I agree that scripting resolves a lot of the issues, and scripting and coding well resolves most of the issues. If only scripting and coding well were a bit easier! :) I think it is hard for lots of scientists to get good at this because it is just one small piece of what they do every week. Picking up and putting down and picking up and putting down doesn’t make a good opportunity to refine best practices. The fact that their code doesn’t have much of an external audience during early days hurts too. That said, I agree that a few ideas can really help, like thinking that code should be skimmable and that bad names are actually software bugs etc.

    Comment by Heather Piwowar — May 5, 2011 @ 3:10 pm

    • It’s a tortoise-and-the-hare issue. You may think that doing things by hand will be faster in the end, but scripting and repeatability more than pay for themselves in the end. Writing code with better naming conventions is easier all around. Etc. etc.

      I agree that it’s hard to pick up the way to do these things. Pair programming is really really helpful for this at all levels. Again, people think it’ll slow them down and make them less productive, but every time I do it, I learn something and the projects go faster. My one objection is that it requires so much concentration that it’s really tiring.

      Comment by Bob Carpenter — May 12, 2011 @ 2:22 pm

  4. […] cognitively difficult task into a cognitively easy one. *How* to share data (well)(or well enough)(or how to know what is well enough) is hard. How to get my data into that format is […]

    Pingback by Data standards address cognitive barriers too « Research Remix — May 25, 2011 @ 12:16 pm


RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Blog at WordPress.com.

%d bloggers like this: