My latest paper, written with my colleague Harald Gruber-Vodicka, has just been published in Frontiers in Microbiology! It describes a software tool I wrote, called gbtools, that makes it easier to visualize metagenomic data for binning (GitHub project page). What exactly does it do, and how do I use it, you ask? Read on…
Metagenomics has become a routine tool for environmental microbiologists, because DNA sequencing is now relatively cheap and easy. However, analyzing the data that comes out remains a significant bottleneck, and being able to visualize and extract meaning from gigabytes of ATCG’s is a challenge.
What many microbiologists, including myself, are interested in is to extract individual genomes from metagenomes. In my own work, for example, we have sequenced and assembled the metagenomes of ciliates that carry around with them a dense covering of bacterial symbionts. What we really want, though, is the genome of the main symbiont species – to separate that from sequences belonging to the host and other bacteria.
There’s plenty of software tools out there for “binning”, as this process is called, but not so much for visualizing the results. Most automated binning tools use some combination of sequence composition and abundance to separate out contigs that probably come from the same genome; ultimately, it’s a statistical and machine-learning problem.
But it’s hard to see what’s going on, and with a good visualization, it should be possible to see at a glance whether a sample is simple or complex, whether particular bins appear to be complete genomes or are fragmentary, and to compare the results of different binning tools or pipelines. The concept is not new — Mads Albertsen and colleagues published a paper where they used plots of sequence coverage (i.e. abundance) and GC%, annotated with information about conserved marker genes, to manually select bins from a relatively complex metagenome. However, their scripts, written in R, require the user to mess around with the raw R code and data structures, and so it isn’t really straightforward to apply them to your own data. In fact, gbtools began its life when I adapted the scripts from Albertsen et al. to save myself unnecessary typing, and then gradually accumulated more features.
gbtools is a package for R and its goal is to make the visualization process easier and more intuitive. It makes it much easier to import data and create plots, and aims to keep the messy back-end stuff out of view, so the users can concentrate on exploring their own data. Ideally, once you have an assembled metagenome, and calculated the read coverage and GC% for each contig with a read-mapper like BBmap, you can generate a plot for a first look at your data in just three lines of R code:
> library (gbtools) # load the package
> d <- gbt(covstats="SampleA2.covstats") # Import data > plot (d) # Draw the plot
The file for the above example can be found in the gbtools package under example_data/Olavius_example. The plot should look like this:
It becomes a bit clearer if you only show the contigs which are longer than 3 kb:
> plot (d, cutoff=3000)
Then it’s easier to see that there are three clusters of contigs (on the right), corresponding to single genomes, which belong to the bacterial symbionts of the animal from which this metagenome was sequenced.
More examples are available at the GitHub page, and in the manual. You can read more about the design of the tool, and how we used it to curate the results of automatic binning pipelines, in the paper. If you do try it out, I would be glad to hear your feedback and suggestions for improvements or new features, which you can send to me via the GitHub page.