Nice blog post from Rafa Irizarry on why Interactive Data Analysis (IDA) is important, instead of mindlessly applying workflows.
Some points I agree with:
- IDA is necessary to discover outliers, to get a “feel” for the data, to check if applied analyses are appropriate
- “Data generators” who produce the raw data are usually not trained data analysts
Some reservations I have about the post:
- I think that knocking on mindlessly-applied workflows is a bit of a crowd-pleasing, “preaching to the choir” statement. If you ask people directly, no one would sign on to the statement “We should use workflows without thinking about whether they are appropriate” (even if in practice that is what many of us are doing, myself included)
- Standardized workflows are useful for reproducibility. Outliers that screw up data analyses are like bugs in computer code. And as anybody who’s tried to get IT help knows, one of the first things we’re asked to do is to reproduce the bug.
What I especially like is his call for IDA to be a bigger part of existing workflows. That is to say, when designing a data analysis pipeline, one should think about how to incorporate diagnostic checks and interactive analysis steps along the way, as a sort of heuristic debugging process. My hunch is that most people already do this, but the challenge is to formalize it as part of the process. That’s definitely something I’ll think about as I go about analyzing my own data.
The necessity of IDA also explains why there’s no such thing as taking “a quick look” at the data to see if there’s something interesting there (also sometimes overheard: “just run it through your pipeline”). I work mostly with genomic data, and most of my time is spent on interacting with the data, determining if a particular question is even appropriate to ask for a particular set of data. “Quick and dirty” is usually more dirty than quick, when all is said and done….
Late last year, my colleague Silke W and I went to Denmark for a short field trip to collect ciliates, where we were hosted by Lasse Riemann of the University of Copenhagen. The site where we collected our material was Nivå Bay, which is famous among environmental microbiologists for the several decades of studies there on sulfur-cycling by microorganisms.
Nivå Bay (above, view from birdwatching tower on a sunny day) is a shallow, sheltered bay where the water is only knee- to waist-height at low tide. Scattered between the tufts of seaweed and seagrass were some off-white, slimy films on the surface of the sediment. These are actually bacterial “veils”, which are sheets of mucus produced by bacteria that embed themselves in them. Like a veil made of lace, each sheet is punctuated by many holes. Unlike a wedding veil, these veils are not meant to hide anything. Instead, you can think of them as a sort of natural-born environmental engineering – the holes allow water to flow through, and the bacteria actively circulate water by beating their flagella. By working together in these colonies, the bacteria can set up a continuous flow of water through the veil. This flow mixes sulfide-rich water coming from below with oxygenated water from above, bringing together the chemicals that they use to generate energy.
There are different species of bacteria that have such behavior. One of them has the wonderful name Thioturbo danicus – the sulfur whirl of Denmark. It has flagella on both poles of its rod-shaped cells. In this video you can see what happens when a single cell is detached from the mucus veil – it ends up tumbling like a propeller, which probably was the inspiration for its name!
Here is a somewhat degraded veil that had been sitting around in a Petri dish for too long. Taken from its natural environment, it soon becomes overgrown with grazing protists and small animals that methodically eat up the bacteria:
You can read more about the veil-forming bacteria from these publications from the microbiologists at Helsingør: Thar & Kühl 2002, Muyzer et al. 2005.
The Pulfrich Effect is an optical phenomenon where objects (or images) moving in a single plane can appear to be in 3D when the light reaching one eye is dimmed, e.g. with a filter. It also has a curious history – Carl Pulfrich (biography – pdf), who discovered the phenomenon, was blind in one eye and never observed it for himself, but nonetheless made many contributions to stereoscopy (the study of 3D vision) in both theory and the construction of apparatus.
Unlike other forms of stereoscopy, this only works with moving objects or animations; it does not work with still images! But what’s really cool is that you don’t need any special equipment to view it, beyond a piece of darkened glass or plastic to act as a filter. Videos exhibiting the Pulfrich effect can be viewed on a normal monitor or TV screen.
In my day job I work with metagenomes from animals and protists that have bacterial symbionts, and I’ve blogged here before about why visualizations are so useful to metagenomics (mostly to flog my own R package). However most existing tools, including my own, require that you install additional software and all the libraries that come with them, and also be familiar with the command line. That’s pretty standard these days for anyone who wants to do serious work with such data, but it can be a big hurdle for teaching. Time in the classroom is limited, and ideally we want to spend more time teaching biology than debugging package installation in R.
Recently stumbled across a 2013 paper from Ryan and Irene Newton describing a tool, called PhyBin, for binning phylogenetic trees, i.e. clustering them by similarity into groups (“bins”). They use the Robinson Foulds metric to represent the distance between trees.
The reason for doing this is to look at the phylogenies of individual gene ortholog clusters in a set of genomes, to find those genes that have a phylogeny different from the others. This might be useful e.g. to detect genes that have undergone horizontal gene transfer. The example they used for their paper was the insect symbiont Wolbachia.
It seems like a nice way to screen a set of genomes for genes that might be interesting. I had wanted to try to do something like this, but with a concordance-factor approach instead. Some other thoughts:
- Each gene is represented by one tree – uncertainty is not taken into account, unlike with concordance factors, as implemented in BUCKy for example
- If there are horizontally-transferred genes, they would probably have patchy distribution and not be in every species. But such genes that are present in only some genomes would be pre-excluded from the analysis, also in concordance analysis. In PhyBin paper the authors mention the case of Wolbachia prophage which has precisely this limitation.
- Collapsing short branches is a good idea
We are often interested in ratios between two quantities. As an example, let’s use data from a study on the sugar content of soft drinks, where the the sugar content declared on the drink label was compared to the actual sugar content measured in the laboratory (Ventura et al. 2010, Obesity – pdf). The paper includes a nice table summarizing their measurements, which I have adapted to produce the plots shown here.
How can we present this data to get the most insight? In my opinion, presenting such data as ratios can obscure useful information; showing scatterplots of the two quantites can make it easier to spot patterns.
What I want for next Christmas is an obsolete Illumina machine: this blog from a Swarthmore professor documents how he is tearing down a used Illumina GAIIx sequencer to scavenge the parts to build an automated fluorescence microscope on the cheap. Human ingenuity (and the perfectly good things that people throw away) will never fail to amaze me….