The R tidyverse

Most regular R users will have felt the influence of Hadley Wickham, whether through the widely-used ggplot2 package that implements the “grammar of graphics”, devtools, plyr, … the list goes on. I was astounded when I first realized that the same person was responsible for all these really useful things.

Most software packages aim at providing tools to make particular tasks easier in a certain language. In comparison, many of the tools that he has developed are in effect streamlining the grammar of the language itself. Once you use ggplot2 and see how intuitive it is to deal with statistical graphics in that way, then the base R plot commands feel impossibly clunky. Similarly, his paper on tidy data and the accompanying tidyr and plyr packages articulate basic ideas about data should be organized in tables. These are ideas that sound very simple, and most of us have probably had some similar thoughts cross our minds as we struggled to reshape raw data into analyzable form, but I certainly would not have been able to formulate the concepts so clearly or implement solutions to change our relationship to data wrangling.

The various packages have seemed to evolve towards a common style and design philosophy, and late last year most of them have been bundled together in a ‘super-package’ called tidyverse. It makes installation much easier, because now you can make sure all these inter-dependent packages are up-to-date with a single command, and probably makes development easier for him and his team. It also goes together with a book titled R for Data Science that he and a coauthor have just released, which is also available online. Noted here for future reference!

Binning trees by topology

Recently stumbled across a 2013 paper from Ryan and Irene Newton describing a tool, called PhyBin, for binning phylogenetic trees, i.e. clustering them by similarity into groups (“bins”). They use the Robinson Foulds metric to represent the distance between trees.

The reason for doing this is to look at the phylogenies of individual gene ortholog clusters in a set of genomes, to find those genes that have a phylogeny different from the others. This might be useful e.g. to detect genes that have undergone horizontal gene transfer. The example they used for their paper was the insect symbiont Wolbachia.

It seems like a nice way to screen a set of genomes for genes that might be interesting. I had wanted to try to do something like this, but with a concordance-factor approach instead. Some other thoughts:

  • Each gene is represented by one tree – uncertainty is not taken into account, unlike with concordance factors, as implemented in BUCKy for example
  • If there are horizontally-transferred genes, they would probably have patchy distribution and not be in every species. But such genes that are present in only some genomes would be pre-excluded from the analysis, also in concordance analysis. In PhyBin paper the authors mention the case of Wolbachia prophage which has precisely this limitation.
  • Collapsing short branches is a good idea