In my day job I work with metagenomes from animals and protists that have bacterial symbionts, and I’ve blogged here before about why visualizations are so useful to metagenomics (mostly to flog my own R package). However most existing tools, including my own, require that you install additional software and all the libraries that come with them, and also be familiar with the command line. That’s pretty standard these days for anyone who wants to do serious work with such data, but it can be a big hurdle for teaching. Time in the classroom is limited, and ideally we want to spend more time teaching biology than debugging package installation in R.
Of course I lied a little – you still need to mess around with the command line to get to the point where you can actually have something to visualize. Feeding it raw reads from the sequencer won’t work, nor Fasta files of the assembly. You’ll first need to calculate the coverage and GC% values using a read mapper, ideally bbmap.sh, which is the most user-friendly mapping tool out there, in my opinion. Once you’ve done that, you should have a table of read coverage, length, and GC% statistics that can be loaded to gbtlite. If you don’t have any data of your own but are just curious about how this works, then try out some of the example data sets that are linked on the page. (You have to first download the raw text files and then upload them again … sorry).
If you’ve managed to follow along up to this point, you should see something like this:
You can see various clusters of points – each cluster usually represents the genome of a single organism. There’s a few nifty features to try:
- You can explore the effect of different scales (linear, logarithmic, square-root), which have a big influence on our ability to perceive these clusters.
- Change the size or color of the plot points, if you’d like, or filter the plot so that you only see contigs that are above a certain minimum length.
- If you scroll down the page there’s a summary of the assembly statistics, and also a histogram of the contig lengths (which is a rough measure of the quality of a metagenome assembly).
- You can even zoom and pan in the plot with the mouse, and mouseover for details on each contig in the plot, as shown below:
These animated transitions were quite straightforward to implement in D3 (and there’s lots of tutorials out there on the subject!) but are no joke in R….. A word of warning, though: the rendering is done completely in your browser so it begins to get sluggish beyond a few thousand plot points. The first set of demo data (shown above) works fine, but the second one is much larger and may hang your browser window. Also, don’t click the button that says “Do not click here”.
At the end of the day, though, this is still just a teaching tool. It’s meant to illustrate the basic idea, but for serious work you’ll still have to take the trouble to install software (or write your own) and learn how to use it. If you’re curious about real-world examples of how visualizations are useful to metagenomics, my colleague Adrien Assie has also been blogging his walk-through of a typical analysis workflow, which gives an idea of what we do with our data and what kinds of questions we are interested in.