New paper! Visualizing metagenome binning with R

My latest paper, written with my colleague Harald Gruber-Vodicka, has just been published in Frontiers in Microbiology! It describes a software tool I wrote, called gbtools, that makes it easier to visualize metagenomic data for binning (GitHub project page). What exactly does it do, and how do I use it, you ask? Read on…

Metagenomics has become a routine tool for environmental microbiologists, because DNA sequencing is now relatively cheap and easy. However, analyzing the data that comes out remains a significant bottleneck, and being able to visualize and extract meaning from gigabytes of ATCG’s is a challenge.

What many microbiologists, including myself, are interested in is to extract individual genomes from metagenomes. In my own work, for example, we have sequenced and assembled the metagenomes of ciliates that carry around with them a dense covering of bacterial symbionts. What we really want, though, is the genome of the main symbiont species – to separate that from sequences belonging to the host and other bacteria.

There’s plenty of software tools out there for “binning”, as this process is called, but not so much for visualizing the results. Most automated binning tools use some combination of sequence composition and abundance to separate out contigs that probably come from the same genome; ultimately, it’s a statistical and machine-learning problem.

But it’s hard to see what’s going on, and with a good visualization, it should be possible to see at a glance whether a sample is simple or complex, whether particular bins appear to be complete genomes or are fragmentary, and to compare the results of different binning tools or pipelines. The concept is not new — Mads Albertsen and colleagues published a paper where they used plots of sequence coverage (i.e. abundance) and GC%, annotated with information about conserved marker genes, to manually select bins from a relatively complex metagenome. However, their scripts, written in R, require the user to mess around with the raw R code and data structures, and so it isn’t really straightforward to apply them to your own data. In fact, gbtools began its life when I adapted the scripts from Albertsen et al. to save myself unnecessary typing, and then gradually accumulated more features.

gbtools is a package for R and its goal is to make the visualization process easier and more intuitive. It makes it much easier to import data and create plots, and aims to keep the messy back-end stuff out of view, so the users can concentrate on exploring their own data. Ideally, once you have an assembled metagenome, and calculated the read coverage and GC% for each contig with a read-mapper like BBmap, you can generate a plot for a first look at your data in just three lines of R code:


> library (gbtools) # load the package
> d <- gbt(covstats="SampleA2.covstats") # Import data > plot (d) # Draw the plot

The file for the above example can be found in the gbtools package under example_data/Olavius_example. The plot should look like this:

Coverage-GC plot for metagenome of an animal with three or more bacterial symbionts.

Coverage-GC plot for metagenome of an animal with three or more bacterial symbionts.

It becomes a bit clearer if you only show the contigs which are longer than 3 kb:

> plot (d, cutoff=3000)

"Coverage-GC

Then it’s easier to see that there are three clusters of contigs (on the right), corresponding to single genomes, which belong to the bacterial symbionts of the animal from which this metagenome was sequenced.

More examples are available at the GitHub page, and in the manual. You can read more about the design of the tool, and how we used it to curate the results of automatic binning pipelines, in the paper. If you do try it out, I would be glad to hear your feedback and suggestions for improvements or new features, which you can send to me via the GitHub page.

Happy binning!

Coverage-GC plot of a synthetic metagenome comprising 64 microbial strains; contigs with marker genes are colored by their taxonomic affiliation

Coverage-GC plot of a synthetic metagenome comprising 64 microbial strains; contigs with marker genes are colored by their taxonomic affiliation

Advertisements

2 comments on “New paper! Visualizing metagenome binning with R

  1. […] what gbtools is? Read my previous blog post, or the paper published last […]

  2. […] job I work with metagenomes from animals and protists that have bacterial symbionts, and I’ve blogged here before about why visualizations are so useful to metagenomics (mostly to flog my own R package). However […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s