Organizational software and computational tools

There's essentially no topic that I'd like to discuss here that I wouldn't also be happy to discuss multiple times. Like everyone, my approaches to any particular organizational task or scientific problem change over time, and the resources, technologies, and prevailing opinions within the community (particularly the bioinformatics community) are constantly in flux. I've spent part of this weekend reorganizing my personal file space and directory layout, since data and document locations were getting out of hand, but I'd rather not discuss the new system until it's had time to sink in a bit. Besides, Bill Noble's written an excellent piece in PLoS CB that deals more fully with the architecture and philosophy of project organization. Instead, I'll mention some of the software tools I use for organization in general, thus guaranteeing myself not one but two follow-up topics (one on the new filesystem layout, and one when this entry is inevitably outdated).

First, note that almost everything I'll discuss here is free, and most of it's linked from my lab's computational resources page. That will obviously be updated more often than this topic will arise on Sundays, so keep an eye out.

As a starting point, I'm writing this in EverNote, which is essentially an auto-synchonizing database of taggable text files. It's roughly the equivalent of putting hundreds of little text files in Dropbox (more on that below), but without the chaos that would ensue if one actually did such a thing. EverNote allows you to create rich text notes (the "rich" part of which I never use), organize them into "notebooks" (folders), and organize them further with zero or more free-text tags. I have a half dozen notebooks representing categories of events (meetings, talks, etc.) and apply tags based on the note's topics. For example, the minutes from a meeting about the Human Microbiome Project would land in the Meetings notebook with an hmp tag. Each note is titled with a date, time, people involved, and a title or topic. The fact that EverNote has a key-bindable whole-database search really ties the room together.

I must be losing my touch if I'm referencing the same movie twice in subsequent weeks. Unfortunately, a the movies we watched during workouts the past few weeks aren't quite as intellectual and lack family-friendly quotable quotes.

In any case, the absolute most important item EverNote remembers for me (aside from everything) is my todo list. I've found no purpose-made todo list application that provides the flexibility I need to manage tasks for myself, my advisees, with different priorities, different deadlines (many with none, just "as soon as possible"), arbitrary notes, and rapid editing. While I realize that essentially every modern todo application claims exactly those features, none that I've tried handles them in a way that I actually find myself using effectively. So plain text in EverNote it is, where I can reorder lines, add indented notes, and write in dates as needed.

As an added bonus, EverNote deleted a couple of my notes during a server outage last year. This may sound like a bad thing, but I only lost about a half day of unimportant notes, and they provided a year's premium subscription for free. While I've never come even close to the generous bandwidth limits of the free service, the ability to disable their little ads is welcome. I recommend that everyone suffer unrecoverable data loss as an alternative to irritating online ads.

Dropbox has been praised across the Internet as the second coming of sliced bread, and there's little I need say about it here. It is a file synchronization service that Just Works: drop a file into your Dropbox folder, which integrates seamlessly with the rest of your filesystem, and it's automatically uploaded to a central server and downloaded onto all of your client machines, regardless of location and operating system. I find it a bit creepy that copies of all of my grants and papers are living on Dropbox's corporate servers, but A) I don't do anything particularly secret and B) it's worth it, believe me! I find myself using five different computers on a regular basis, and Dropbox guarantees that the same files are available on all of them, all the time, without any intervention needed on my part.

On a related note, our revision control system of choice is Mercurial. Dropbox provides unstructured, public file synchronization; Mercurial provides structured, private synchronization with finer change tracking and all of the diff/merge bells and whistles you'd expect from an RCS. That is, it provides a simple mechanism by which the change history for any file can be stored, logged, searched, and visualized. Imagine not having to save separate files named grant_final.docx, grant_final_comments.docx, grant_really_final.docx, and grant_really_final2.docx all the time! Mercurial (and other RCSs) allow you to "commit" versions of a file to a "repository", which is simply a hidden database that remembers each change you made. This makes it easy to ask questions like, "What's changed between the document I'm editing now and the version from two weeks ago?"

Each of the lab's major projects has an individual repository stored in a shared location on our internal server; a subset of these are automatically synchronized with read-only public copies. Additionally, myself and several of the lab members have personal repositories (all mirrored and automatically backed up on the internal server) that we can use for private scripts, document change tracking, or other tidbits that we don't want to lose. In addition to papers, anything that looks like code should absolutely be version controlled, just for your own sanity - when it breaks, there's a good chance hg diff will tell you why!

I wish I could show you our nifty internal Redmine site, which hosts the lab wiki and automatically synchronizes with these Mercurial repositories. Redmine is a complete web-based project management suite that we've not yet taken complete advantage of. We use its wiki features extensively, and most lab members use that space to share results, notes, and weekly progress reports. Redmine includes automatic issue ticketing and tracking, calendaring, todo items, and a host of other bells and whistles that I've not yet explored as much as I'd like to. Simply its ease of installation and diversity of features makes it worth a look, though, even if you only end up using one or two.

To return a final time to the theme of data synchronization, Zotero performs that function for our PDFs. I'm unfortunately not thrilled with it, and I've tried several other tools over the years. The three things that frustrate me the most are that A) none of these tools really do a particularly good job of managing a database of papers to read, B) none of them even make a passable effort at managing both PDFs and references (a la EndNote, coming up below), and C) there are two Mac-only applications that satisfy these needs admirably. Windows developers, get your acts together! That being said, Zotero fits the bill of automagically grabbing papers from the Internet, synchronizing them among computers (after some headaches setting up unrestricted storage), sharing them with lab members, and organizing them using folders and tags. Now if only it didn't keep telling me I had so many left unread...

EndNote is the lab's reference manager of choice, due to my academic upbringing and its acceptance as a de facto standard among journals and colleagues. For those who haven't encountered it, EndNote is a database manager for citations - not papers themselves, just references and biblographies - that integrates (relatively) seamlessly with Word, allows you to jot in approximate references as you type, and subsequently formats them to any one of hundreds of publication standards. Its major drawbacks are perennial bugginess and being exceptionally non-free; on the up side, using it will get you a discount on many journals' page fees, so it pays for itself if you're a responsible scientist writing papers on a regular basis. I've found that the second version that comes out after each new version of Word tends to be the best behaved - X3 works great with Word 2007, and X4 with Word 2010. Mix and match at your own peril, and heaven help you if you have a Mac.

As this clearly implies, the lab uses the unholy alliance of EndNote and Word for authoring manuscripts. I'm very familiar with LaTeX and BibTeX - I took notes for years using LyX - but they're not particularly adequate for collaborative scientific editing. It's tough to paste an image into LaTeX, and diff just doesn't stand up to Word's multi-author change tracking. Anti-Microsoft sentiment seems to have died down since the whatever-the-past-decade-is-called, and I can only hope that tricks like these continue to help their penetrance in academia. There comes a point at which it's worth paying for tools that reliably help to save time!

Speaking of, we've dabbled with Illustrator for manuscript figures, and while it's obviously the most flexible tool out there, it's even less free than EndNote. Instead, I've been pleasantly surprised by the leaps and bounds taken by InkScape and OpenOffice Draw in the last few versions. Both of these are vector graphics programs comparable in spirit to Illustrator and appropriate for the diagrams, figure assembly, and line art characteristic of authoring scientific manuscripts. Each does have its own selection of warts and lumps sadly characteristic of free software, though. I find InkScape to be slightly more conducive to creating single diagrams from the ground up and OOD better for integrating multiple images (e.g. figure subparts), but your mileage may vary. For those who have used either program in the past, be advised that at least in my experience, the latest versions (0.48 and 3.2, respectively) are eminently more usable than historical releases.

Finally, the most cross-cutting tool I find myself using in day-to-day operations is Cygwin. If you're in a career that involves data - any data - try it out! On the surface, it's just a nuts-and-boltsy port of a standard Linux environment that runs on Windows; it provides a command prompt and a whole bunch of UNIX tools that interact seamlessly with Windows and its filesystem. That looks ugly from a usability perspective; who but a computer scientist wants to be using a black-and-white command prompt?

What I find to be the redeeming feature is that the folks who developed UNIX back in the 80s were scientists, and these tools are made to move data around, not just for computer geeks. scp gets me data from my server; sort, grep, uniq, cut, and less keep me from having to reinvent the wheel when processing data. Even if you're allergic to computers, forcing yourself to use a command line interface with some of these tools is worth the effort; the learning curve is a bit harsher than Excel, but in the long run, it's a lot faster than clicking on cells in a spreadsheet.

For my own sanity and due to juggling several multi-PI projects in the HMP, I've been very interested lately in reproducible research. This can be defined as the use of scriptable workflow environments to track what steps were run on what data to produce what results at each stage of an analysis. For example, to perform metabolic reconstruction of several hundred HMP metagenomic samples, we need to run each group of sequences through BLAST, extract part of the results, split them, compare the two branches with KEGG and MetaCyc, respectively, perform a few alternate analyses on each branch, and finally consolidate and evaluate the results quantitatively. That's a lot of steps on a lot of data files, and it's important to write the process down for archival purposes before it's run and to record what actually happened during each run. Taverna and Kepler are canonical tools to manage such workflows that I unfortunately find to be horrid, and Galaxy is an up-and-coming web application that also has some rough edges left to polish. Machine learning tools in a similar spirit (e.g. RapidMiner and Orange) are remarkably usable, but they're generally confined to numerical tasks and not appropriate for general data analysis.

The tool that the lab has currently settled on for reproducible research is SCons, a Python-based build tool primarily intended for compiling computer programs (a la make). SCons allows one to specify a workflow as a flexible (perhaps overly so) combination of descriptive rules (e.g. "Input file X can be transformed to output file Y using program Z") and imperative statements ("To perform this transformation, run code A then B then C"). This puts it midway between a dependency-based system like make and an imperative scripting language like, well, Python. Key for its use in a data analysis setting, it notices changes to input files and reruns only the necessary steps when there's a modification, and it can automatically parallelize independent processing steps. These features are both shared with make itself, but SCons workflow definitions use a syntax that's not old enough to drink; that's a plus.

Sergey Fomel and Gilles Hennenfent have written an entire book chapter about Reproducible computational experiments using SCons. They use some tricks that I find a bit overly complex, but it's extremely thorough and a great example of the power of this technique. The learning curve will hurt for N minutes, but rerunning the same analysis M times for P projects will hurt M*P. Unless you're lucky (M =~ 1) or lazy (P =~ 0), you're better off taking the hit up front.

To wrap up with a paragraph of polyglot that you're welcome to skip if (or when) it's uninteresting, my programming environments of choice are, in order of speed-of-development to speed-of-execution ratio, Ruby using Notepad++, Python using PyDev in Eclipse, and C++ using Visual Studio Express. These are essentially size 1, 5, and 10 knitting needles, respectively, or drill bits if you're feeling masculine. I find myself writing Ruby one-liners from the command line to perform data manipulation too irritating for shell commands, Python scripts for more extensive manipulation and visualization, and C++ programs to provide finalized implementations of scalable algorithms in packages like Sleipnir. I have no strong feelings about any of these languages or environments - that's seriously not a road anyone should go down - but I do recommend having some comparable tools in your toolbox. I've known folks who have great luck using C# for tasks where I'd probably use Python, for example, and we all use R for anything heavily numerical.

The bad news is that this is taking on more of a laundry list flavor than I'd like, but the good news is that I've essentially exhausted the computational tools that I and the lab use on a daily basis. I know that I certainly appreciate any suggestions for ways to make research, grant writing, and editing more efficient, so hopefully there's at least some value in listing a set of tools that come together to keep our lab ticking. Well, those in combination with some middling to good coffee; we're taking suggestions for a new department espresso machine as well! Coffee tastes better than laundry lists any day.