Molecular Evolution and Phylogenetics

Posted on 2021-08-20 Edited on 2021-10-23

Learning notes of Chapter 27, Molecular Evolution and Phylogenetics, Computational Biology, MIT course 6.047/6,878

Basics of phylogeny

Trees
- node: a divergence event between two ancestral lineages
  - leaves: present objects
  - root: the common ancester
- edge: a relationship between two nodes

branch length
- Cladogram: no meaning to branch lengths; only the sequence and topology of the branching matters.
- Phylogram: branch lengths are directly related to the amount of genetic change.
- Chronogram (ultrametric tree): branch lengths are directly related to time.
  - The leaves in this tree end on the same vertical line
  - Time and genetic change are not necessarily proportional because evolution rates / mutation rates are not constant.

Traits
- Morphological traits: empirical evaluation of physical traits.
  - advantages: easy to quantify and understand.
  - disadvantages: small set of traits, only build species, easier to be ‘tricked’ by convergent evolution.
- Sequence traits: genomes
  - advantages: much more data to create gene trees.
  - difficulties: DNA is only built from 4 bases, so back mutations are frequent.
homology: a pair of genes are called
- paralogues: diverged from a duplication event
- orthologues: diverged from a speciation event

It is hard to extract correct sequences from extinct species due to DNA breaking down over time and contamination from the environment.

Methods for tree construction
- Distance based approaches
  - quantify the amount of mutations to fit the most likely tree according to the pair-wise distance matrix
  - direct algorithm based on some assumptions
- Character based approaches
  - rely on tree proposal and scoring techniques to perform a heuristic search over the space of trees

Distance based methods

Nucleotide divergence: based on the number of places where nucleotides are not consistent. This assumes that evolution happens at a uniform rate across the genome, and that a given nucleotide is just as likely to evolve into any of the other three nucleotides.
Transitions and transversions: recognize that A-G and T-C substitutions are most frequent.
Synonymous and non-synonymous substitutions: substitutions that do not change the coded protein will have a higher probability of occurring than those substitutions which do change the coded amino acid.

A distance metric satisfying is not enough.
Additive distance: satisfy for a path of evolving sequence.
But the naive mismatch fraction do not always have this property because of Back-mutations: When a large number of mutations accumulate on a sequence, not all the mutations introduce new mismatches, some of them may occur on already mutated base pair, resulting in the mismatch score remaining the same or even decreasing.

Jukes-Cantor model

A simple markov model to backtrack , the amount of time elapsed from the fraction of altered bases.
denotes the respective probabilities of observing base given a starting state of base in time ,
where . Assume this is a stationary markov model, implying the matrix is multiplicative, i.e., .
For a very short time , ignore second-order effect, i.e., there isn’t enough time for two mutations to occur at the same nucleotide. So the probabilities of cross transitions are all proportional to .
At time ,
From we obtain
which rearrange as the coupled system of differential equations
With the initial conditions and . The solutions can be obtained as
With the fraction of the sites where the bases differ in a given alignment, we have
implying
To agree asymptotically with , we set the evolutionary distance to be
The uncertainty values of the Jukes-Cantor distance also bocomes very large when approaches .

Other models

# substitution types	unequal base frequency	equal base frequency
4	GTR (general time reversible)	SYM (symetric)
3	TrN(Tamura-Nei) 2 transitions, 1 transversion	K3ST (Kimura 3 sub. type) 1 transition, 2 transversion
2	HKY85 (Hasegawa-Kishino-Yano) F84 (Felsenstein) 1 transition, 1 transversion	K2P (Kimura 2 parameter) 1 transition, 1 transversion
1	F81 (Felsenstein) 1 substitution type	JC (Jukes Cantor) 1 substitution type

Current area of research: different parameters for different parts of the tree to account for different mutation rates.

Distances to trees

In distance based methods, the problem is to reconstruct the tree given this distance matrix.

distance to trees

The evolutionary distance is an inequality rather than an equality. E.g., the sequences’ distance between mouse and rat is probably less than because of overlap, convergent evolution, and transversion.

Note: Each tree does correspond to one distance matrix, but the opposite is not always true. A distance matrix has to satisfy additional properties in order to correspond to some weighted tree.

There are two models that assume special constraints on the distance matrix:

Ultrametric
- For all triplets of leaves, labelled such that
- The two leaves that are more closely related (say ) have diverged from the third () at exactly the same time.
Additive
- All quartet of leaves labelled satisfy

These types of redundant equalities must occur while mapping a tree to a distance matrix, because a tree of nodes has parameters, one for each branch length, while a distance matrix has parameters. Hence, a tree is essentially a lower dimensional projection of a higher dimensional space.

Real datasets do not exactly satisfy either ultrametric or additive constraints due to noise, stochasticity and randomness, fluctuations, different rates of mutations, gene conversions and horizontal transfer.
tree-building algorithms that are able to handle noisy distance matrices.

UPGMA - Unweighted Pair Group Method with Arithmetic Mean: Hierarchical clustering
- Ultrametrification of non-ultrametric trees
  - construct a completely connected graph with weights given by the original distance matrix
  - find a minimum spanning tree (MST) (ie Prims algorithm)
  - build a new distance matrix with elements given by the largest weight on the unique path in the MST from to
- Weaknesses
  - lack of robustness
  - suffer from the molecular clock assumption that the mutation rate over time is constant for all species, which can lead to long branch attraction

Neighbor joining
- Finding the neighboring leaves: Let
  It can be proved that the above modification ensures that is minimal only if are neighbor. (P189 of Durbin’s book)
- Initialization: Define to be the set of leaf nodes, one per sequence. Let
- Iteration:
  - Pick such that is minimized.
  - Define a new node , and set .
  - Add to , with edges of lengths
  - Remove from
  - Add to
- Termination: When consists of two nodes , and the edge between them of length , add the root node as parent of and .

Character-Based Methods

Score the probability that a given tree would produce the observed sequences at its leaves.
Search through the space of possible trees for a tree that maximizes that probability. (NP-Hard tractable heuristic search methods)

character based model

Scoring

Paisimony: score a topology based on the minimum number of mutations it implies, given the (known) sequences at the leaves.
The algorithm first scans up from the (known) leaf sequences, assigning a set of bases at each internal node based on its children. Next, it iterates down the tree, picking bases out of the allowed sets at each node, this time based on the node’s parents.
Given a tree, and an alignment column . Label internal nodes to minimize the number of required substitutions.
- Initialization: Set cost ;
- Iteration:
  - If is a leaf, set
  - If is not a leaf, let be the daughter nodes;
    Set if insertion is nonempty
    Set , and if insertion is empty
- Termination: Minimal cost of tree for column
- Traceback:
  - Choose an arbitrary nucleotide from for the root
  - Having chosen nucleotide for parent ,
    If choose for daughter
    Else, choose arbitrary nucleotide from
- Advantages: simple and fast
  Disadvatages:
  - assumes that a given base pair undergoes a substitution along at most one branch from a given node, ignoring highly probably internal sequences that violate this assumption.
  - do not explicitly model the time represented along each edge, and thus cannot account for the increased mutation rates along edges that represent a long temporal duration.

Maximum Likelihood - Peeling Algorithm: score a tree according to the (log) joint probability of observing the data and the given tree .
The peeling algorithm considers individual base pairs and assumes that all sites evolve independently.

Each node has a character and is the corresponding branch length from its parent. are known constants, but are unknown characters at ancestral nodes which are variables to which we will assign maximum likelihood values. We want to compute .
E.g.
peeling
By marginalization
Assuming each branch evolves independently, we can use the factorization trick
Move the factors that are independent of the summation variable outside the summation variable outside the summation
The probability denotes the probability of base mutating to base given time , which is essentially obtained from the Jukes Cantor model or some more advanced model discussed earlier.

In general, let be the subtree below . We use dynamic programming array to compute , the probability of observing , if node contains base .
Then we want to compute the probability of observing , which is
For each ancestral node and its childer ,
subject to the initial conditions for the leaf nodes, i.e., for ,
is usually assigned equally or from some prior distribution but does not affect the results greatly.

Finally, multiply all the probabilities for individual sites to obtain the probability of observing the set of entire sequences.
In addition, we can multiply the resulting score by some prior probabilities of the tree structure and the set of branch lengths, generated using explicit modeling of evolutionary processes (Yule process or birth-death models like the Moran process). posterior probability (Bayesian inference framework).
The overall complexity , where is the number of leaves (taxa), is the sequence length, and is the number of characters.

Search

The number of full rooted trees with leaves is the -th catalan number
and compute the maximum likelihood set of branch length for each of these trees.
NP-Hard problem to maximize the score absolutely for all trees.

Heuristic search algorithms (successive local optimization, converge towards an overall good solution)
- Initialization: Take some tree as the base of iteration (randomly or according to some other prior, or from the distance based direct algorithms).
- Proposal: Propose a new tree by randomly modifying the current tree slightly, using Nearest Neighbor Interchange (NNI) scheme.
  - The tree space should be connected (successive proposals).
  - An individual new proposal should be sufficiently close to the original.
- Score: Score the new proposal according to the methods above.
- Select: Randomly select the new tree or the old tree (corresponding probabilities according to the score (likelihood) ratio).
  - Always accept the one that scores better, and accept the one that scores worse with some (not too much) probability.
  - Metropolis Hastings, a Markoc Chain Monte Carlo Method (MCMC), for exploring the state space.
- Iterate: Repeat to proposal step unless some termination criteria is met (some threshold score or number of steps reached).

Bootstrapping: run the algorithm over and over using substes of the base pairs in the leaf sequences, ehen favoring global trees that match the topologies generated by using only these subsequences.

Possible theoretical and practical issues

Rely on heavily conserved genes, ignoring non-conserved genes.
- Take into account less conserved genes
Aligned sequences are still not explicit in regards to the events that created them.
- Combinations of speciation, duplicatioin, loss, and horizontal gene transfer (hgt) events are easy to mix up.
Debate in the field on a non arbitrary way to define species and to infer phylogenetic relationships to recreate the tree of life.
- Assume all the concatenated genes had the same history, but events such as hgt and duplications could have occurred differently for different genes.
- While hgt is prevalent, orthologs used for phylogenetic reconstruction are consistent with a single tree of life.