[Edited on 10 Feb to fix some errors and ambiguous wording pointed out by Lucas. Added text in blue – thanks Lucas!]
Most software used by academic scientists is made by other scientists and available for use free of charge, but the phrase caveat emptor – buyer beware – still applies. As end users, we trust them to do more or less what they say on the box, but this doesn’t always happen.
The Exelixis lab, makers of the popular phylogenetic tool RAxML, have recently released a preprint (bioRxiv) looking at whether phylogenetic tree drawing software draws the support values properly. Short version: they don’t always do it right! And these errors can, and have, creeped into the published literature. (Dendroscope, one of the tools compared, released a bug fix soon after this article came out.) Coincidentally I had met the lead author of the preprint, Lucas Czech at a conference recently, but only came across this article when I was searching for something else online.
The main reason for the problem is, as with most problems in bioinformatics, different file formats. Support values can be written in a file either as properties of nodes or of branches. If a tree file is formatted one way but the drawing program assumes the other, then the support value can end up being placed on the wrong location, especially if the tree is rerooted for drawing.
Support values should properly be considered properties of branches, not nodes. In case it isn’t completely clear why this should be the case, I’ve written a short explanation below.
The simplest possible tree
Take the simplest tree for which we can calculate support values, an unrooted tree comprising four leaves. In bracket notation, ((a,b),(c,d)). There are two internal nodes, which we will call e and f, and an internal branch connecting them EF.
Support values are a measure of how well-supported a given topology is. For this tree, there are two alternative topologies to which to compare our tree, namely ((a,c),(d,b)) and ((a,d),(b,c)). It should be clear that ((b,a),(d,c)) and (a,(b,(c,d))) for example are equivalent to our first tree, the latter because the trees are unrooted.
The relationship between our initial tree and our alternative trees is analogous to the relationship between cis and trans isomers across an asymmetrical carbon-carbon double bond, which you learn about in beginning organic chemistry.
Internal branches such as EF are also called bipartitions, because they split the nodes and/or leaves in the tree into two sets (on the left and right). For example, the internal branch EF in our first tree produces the bipartition ab/cd. Here we are only dealing with strictly bifurcating trees.
One popular method for calculating support is the nonparametric bootstrap. Basically, the alignment used to make the tree is randomly sampled with replacement, and the resampled alignment is used to calculate a new tree. This is repeated many times. The set of new bootstrap trees is examined. For each bipartition on the original tree, we count how many times that same bipartition occurs in the set of bootstrapped trees. In our four-leaf tree, there is only one internal branch/bipartition with
two three possible bipartitions ways of splitting up the taxa, as shown earlier. If the bipartition ab/cd appears in 80% of the bootstrap trees, ad/cb in 15%, and ac/bd in 5%. Then we say draw the bipartition ab/cd has 80% bootstrap support, and put the number “80%” on that branch.
It should therefore, I hope, sound reasonable to assert at this point that support values like the bootstrap should be properties of branches, because they are derived by comparing bipartitions, which are derived from branches.
Confusion can arise when we draw the tree in rooted form.
Where does the 80% label go on the tree above? We have arbitrarily chosen to root it in the middle (midpoint rooting), which is often done when there is no compelling reason to choose any other point to root the tree. The bipartition spans the root, but if we put the label there (as shown on the left) it could be misconstrued to imply that there is 80% support for placing the root in that location, which is not what we are trying to say!
A compromise solution is to draw the 80% label twice, on both the branch leading to ab and the branch leading to cd (as shown on the right). Is done in the software tool Figtree, for example. However this is also potentially misleading, because it looks like two independent support values for each of the two clades, when in reality it is the same number.
If we instead root the tree between d and abc, e.g. if taxon d is our biological outgroup, then it looks weird because the branch leading to ab has a support value, whereas the branch leading to abc is undecorated. A naive reader seeing such a tree might wonder why that branch has “zero support”. In fact, that branch corresponds to the bipartition d/abc. But because the left side of that bipartition, d, is a leaf, that bipartition must always occur in all the bootstrap trees, so by definition it has 100% support!
If we consider the root as simply a special kind of leaf (botanists may disagree), then it can be argued that it can also take a support value, corresponding to our confidence (or likelihood or posterior probability or whatever framework you choose) that the root should be in that position.
Lets call the root r and treat it in our notation like we would a leaf. Then our two alternative rooted trees above become (r,((a,b),(c,d))) and ((r,d),(c,(b,a))) respectively.
The act of arbitrarily rooting the tree is equivalent to giving it a subjective support value, proportional to our belief that this is the proper position for the root.
There may be phylogenetic models that consider rooted trees. Just as a hypothetical example (this may or may not be a realistic model), the algorithm could automatically midpoint-root all the inferred trees, including the bootstrap trees. The root should then be considered in making the bipartitions, and so the support values of the most basal branches would have a natural interpretation.
What does the root really represent?
We have so far imagined roots as special types of leaves, but it is worth thinking further about what roots really represent. Usually biologists perform phylogenetic analyses on only a subset of existing organisms (placing the root in the universal tree of life is a story for another time…). To extend the arboreal analogy, we are taking only a single branch of the entire tree of life, and pretending that it is a miniature tree, with the base of the branch its “root”.
From this perspective, unless we are dealing with the universal tree, the root is not a special kind of leaf, but an arbitrary breakpoint. Lets use a biological example. Suppose you are interested in the phylogeny of plants. If we look only at the vascular plants, we have four taxa – monocots, dicots, conifers, and ferns (in oversimplified terms). The unrooted tree topology is ((M,D),(C,F)), because monocots and dicots are both flowering plants, and they are incontrovertibly each others closest relatives. But where should we root the tree?
We could arbitrarily say that ferns are our outgroup, and root the tree there: (F,(C,(M,D))). But then we would be making a subjective or heuristic argument. We could also look at the branch lengths and calculate a midpoint root. What is the support value? Because the outgroup is a single leaf, the bootstrap for that bipartition F/CMD is 100%! Technically correct, but misleading for the reasons explained already.
What we really should be doing is to zoom out and include more organisms in our analysis. Lets throw in what is likely the next closest relative to vascular plants, the mosses (Moss). We get a tree like (Moss,(F,(C,(M,D)))). Moss is our outgroup, but again it is a single taxon and we have the same problem as above. But if we add one more outgroup taxon, liverworts, we get the topology ((Moss,Liverwort),(F,(C,(M,D)))). Now we have a proper bipartition between the outgroup and the ingroup, and we can put a “real” support value on the root branch.
I hope that I have made a convincing explanation of why
- Support values should be properties of branches, not of nodes,
- Support values around the root of a tree should be treated with caution, and
- Outgroups should contain at least two taxa for the support values to make sense.