Binning trees by topology

Recently stumbled across a 2013 paper from Ryan and Irene Newton describing a tool, called PhyBin, for binning phylogenetic trees, i.e. clustering them by similarity into groups (“bins”). They use the Robinson Foulds metric to represent the distance between trees.

The reason for doing this is to look at the phylogenies of individual gene ortholog clusters in a set of genomes, to find those genes that have a phylogeny different from the others. This might be useful e.g. to detect genes that have undergone horizontal gene transfer. The example they used for their paper was the insect symbiont Wolbachia.

It seems like a nice way to screen a set of genomes for genes that might be interesting. I had wanted to try to do something like this, but with a concordance-factor approach instead. Some other thoughts:

  • Each gene is represented by one tree – uncertainty is not taken into account, unlike with concordance factors, as implemented in BUCKy for example
  • If there are horizontally-transferred genes, they would probably have patchy distribution and not be in every species. But such genes that are present in only some genomes would be pre-excluded from the analysis, also in concordance analysis. In PhyBin paper the authors mention the case of Wolbachia prophage which has precisely this limitation.
  • Collapsing short branches is a good idea