Molecular Evolution and Phylogenetics
Learning notes of Chapter 27, Molecular Evolution and Phylogenetics, Computational Biology, MIT course 6.047/6,878
Basics of phylogeny
- Trees
- node: a divergence event between two ancestral lineages
- leaves: present objects
- root: the common ancester
- edge: a relationship between two nodes
- node: a divergence event between two ancestral lineages
- branch length
- Cladogram: no meaning to branch lengths; only the sequence and topology of the branching matters.
- Phylogram: branch lengths are directly related to the amount of genetic change.
- Chronogram (ultrametric tree): branch lengths are directly related to time.
- The leaves in this tree end on the same vertical line
- Time and genetic change are not necessarily proportional because evolution rates / mutation rates are not constant.
Traits
- Morphological traits: empirical evaluation of physical traits.
- advantages: easy to quantify and understand.
- disadvantages: small set of traits, only build species, easier to be ‘tricked’ by convergent evolution.
- Sequence traits: genomes
- advantages: much more data to create gene trees.
- difficulties: DNA is only built from 4 bases, so back mutations are frequent.
- Morphological traits: empirical evaluation of physical traits.
homology: a pair of genes are called
- paralogues: diverged from a duplication event
- orthologues: diverged from a speciation event
It is hard to extract correct sequences from extinct species due to DNA breaking down over time and contamination from the environment.
- Methods for tree construction
- Distance based approaches
- quantify the amount of mutations to fit the most likely tree according to the pair-wise distance matrix
- direct algorithm based on some assumptions
- Character based approaches
- rely on tree proposal and scoring techniques to perform a heuristic search over the space of trees
- Distance based approaches
Distance based methods
- Nucleotide divergence: based on the number of places where nucleotides are not consistent. This assumes that evolution happens at a uniform rate across the genome, and that a given nucleotide is just as likely to evolve into any of the other three nucleotides.
- Transitions and transversions: recognize that A-G and T-C substitutions are most frequent.
- Synonymous and non-synonymous substitutions: substitutions that do not change the coded protein will have a higher probability of occurring than those substitutions which do change the coded amino acid.
A distance metric satisfying
Additive distance: satisfy
But the naive mismatch fraction do not always have this property because of Back-mutations: When a large number of mutations accumulate on a sequence, not all the mutations introduce new mismatches, some of them may occur on already mutated base pair, resulting in the mismatch score remaining the same or even decreasing.
Jukes-Cantor model
A simple markov model to backtrack
For a very short time
Other models
# substitution types | unequal base frequency | equal base frequency |
---|---|---|
4 | GTR (general time reversible) | SYM (symetric) |
3 | TrN(Tamura-Nei) 2 transitions, 1 transversion | K3ST (Kimura 3 sub. type) 1 transition, 2 transversion |
2 | HKY85 (Hasegawa-Kishino-Yano) F84 (Felsenstein) 1 transition, 1 transversion | K2P (Kimura 2 parameter) 1 transition, 1 transversion |
1 | F81 (Felsenstein) 1 substitution type | JC (Jukes Cantor) 1 substitution type |
Current area of research: different parameters for different parts of the tree to account for different mutation rates.
Distances to trees
In distance based methods, the problem is to reconstruct the tree given this distance matrix.
The evolutionary distance is an inequality rather than an equality. E.g., the sequences’ distance between mouse and rat is probably less than
Note: Each tree does correspond to one distance matrix, but the opposite is not always true. A distance matrix has to satisfy additional properties in order to correspond to some weighted tree.
There are two models that assume special constraints on the distance matrix:
- Ultrametric
- For all triplets
of leaves, labelled such that - The two leaves that are more closely related (say
) have diverged from the third ( ) at exactly the same time.
- For all triplets
- Additive
- All quartet of leaves labelled
satisfy
- All quartet of leaves labelled
These types of redundant equalities must occur while mapping a tree to a distance matrix, because a tree of
Real datasets do not exactly satisfy either ultrametric or additive constraints due to noise, stochasticity and randomness, fluctuations, different rates of mutations, gene conversions and horizontal transfer.
- UPGMA - Unweighted Pair Group Method with Arithmetic Mean: Hierarchical clustering
- Ultrametrification of non-ultrametric trees
- construct a completely connected graph with weights given by the original distance matrix
- find a minimum spanning tree (MST) (ie Prims algorithm)
- build a new distance matrix with elements
given by the largest weight on the unique path in the MST from to
- Weaknesses
- lack of robustness
- suffer from the molecular clock assumption that the mutation rate over time is constant for all species, which can lead to long branch attraction
- Ultrametrification of non-ultrametric trees
- Neighbor joining
- Finding the neighboring leaves: Let
It can be proved that the above modification ensures that is minimal only if are neighbor. (P189 of Durbin’s book) - Initialization: Define
to be the set of leaf nodes, one per sequence. Let - Iteration:
- Pick
such that is minimized. - Define a new node
, and set . - Add
to , with edges of lengths - Remove
from - Add
to
- Pick
- Termination: When
consists of two nodes , and the edge between them of length , add the root node as parent of and .
- Finding the neighboring leaves: Let
Character-Based Methods
- Score the probability that a given tree would produce the observed sequences at its leaves.
- Search through the space of possible trees for a tree that maximizes that probability. (NP-Hard
tractable heuristic search methods)
Scoring
- Paisimony: score a topology based on the minimum number of mutations it implies, given the (known) sequences at the leaves.
The algorithm first scans up from the (known) leaf sequences, assigning a set of bases at each internal node based on its children. Next, it iterates down the tree, picking bases out of the allowed sets at each node, this time based on the node’s parents.
Given a tree, and an alignment column. Label internal nodes to minimize the number of required substitutions. - Initialization: Set cost
; - Iteration:
- If
is a leaf, set - If
is not a leaf, let be the daughter nodes;
Setif insertion is nonempty
Set, and if insertion is empty
- If
- Termination: Minimal cost
of tree for column - Traceback:
- Choose an arbitrary nucleotide from
for the root - Having chosen nucleotide
for parent ,
Ifchoose for daughter
Else, choose arbitrary nucleotide from
- Choose an arbitrary nucleotide from
- Advantages: simple and fast
Disadvatages:- assumes that a given base pair undergoes a substitution along at most one branch from a given node, ignoring highly probably internal sequences that violate this assumption.
- do not explicitly model the time represented along each edge, and thus cannot account for the increased mutation rates along edges that represent a long temporal duration.
- Initialization: Set cost
- Maximum Likelihood - Peeling Algorithm: score a tree according to the (log) joint probability of observing the data and the given tree
.
The peeling algorithm considers individual base pairs and assumes that all sites evolve independently.
Each node has a character
E.g.
By marginalization
In general, let
Then we want to compute the probability of observing
Finally, multiply all the probabilities for individual sites to obtain the probability of observing the set of entire sequences.
In addition, we can multiply the resulting score by some prior probabilities of the tree structure and the set of branch lengths, generated using explicit modeling of evolutionary processes (Yule process or birth-death models like the Moran process).
The overall complexity
Search
The number of full rooted trees with
- Heuristic search algorithms (successive local optimization, converge towards an overall good solution)
- Initialization: Take some tree as the base of iteration (randomly or according to some other prior, or from the distance based direct algorithms).
- Proposal: Propose a new tree by randomly modifying the current tree slightly, using Nearest Neighbor Interchange (NNI) scheme.
- The tree space should be connected (successive proposals).
- An individual new proposal should be sufficiently close to the original.
- Score: Score the new proposal according to the methods above.
- Select: Randomly select the new tree or the old tree (corresponding probabilities according to the score (likelihood) ratio).
- Always accept the one that scores better, and accept the one that scores worse with some (not too much) probability.
- Metropolis Hastings, a Markoc Chain Monte Carlo Method (MCMC), for exploring the state space.
- Iterate: Repeat to proposal step unless some termination criteria is met (some threshold score or number of steps reached).
Bootstrapping: run the algorithm over and over using substes of the base pairs in the leaf sequences, ehen favoring global trees that match the topologies generated by using only these subsequences.
Possible theoretical and practical issues
- Rely on heavily conserved genes, ignoring non-conserved genes.
- Take into account less conserved genes
- Aligned sequences are still not explicit in regards to the events that created them.
- Combinations of speciation, duplicatioin, loss, and horizontal gene transfer (hgt) events are easy to mix up.
- Debate in the field on a non arbitrary way to define species and to infer phylogenetic relationships to recreate the tree of life.
- Assume all the concatenated genes had the same history, but events such as hgt and duplications could have occurred differently for different genes.
- While hgt is prevalent, orthologs used for phylogenetic reconstruction are consistent with a single tree of life.