Estimating dN/dS

Posted on 2021-08-15 Edited on 2021-10-23

Estimate the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions

同义（synonymous）突变：silent
非同义（nonsynonymous）突变：核苷酸改变
由于同义突变的频率远高于非同义突变的频率，且对于不同基因相似，同义替换可用作确定密切相关物种进化时间的分子钟。

当每个密码子的突变数量只有一个，同义突变和非同义突变的数量只需要通过计数可得。
但是如果一个密码子中有两个或三个位点发生突变，区分同义突变和非同义突变就不是那么容易，存在不同的突变路径。
- Perler et al. (1980)统计方法，不同进化路径赋予相同权值；
- Miyata and Yasunaga (1980) and Li et al. (1985)更复杂的方法，相对于amino acid-altering substitutions，赋予包含silent substitutions的进化路径更大的权值。

但以上方法都有些复杂，以下先介绍 Nei and Gojobori (1986) 无权值方法(unweighted version of Miyata and Yasunaga’s)能对同义突变和非同义突变的数量达到同样准确的估计。进一步，当每个位点的核苷酸替代数量相对较少时，有一个更粗糙的方法能给出足够准确的估计。

但是单一利用核苷酸替代或氨基酸替代信息会造成一些问题，因此介绍 Codon-based model (Goldman and Yang 1994) 以及 Poisson framework: dNdScv (Inigo Martincorena et al. 2017).

modified (unweighted) version of Miyata and Yasunanga’s (1980) method

genetic code table 表明密码子第二个位点的核苷酸改变必然会导致氨基酸改变，而部分第一个和第三个位点的核苷酸改变是同义突变。在相同核苷酸频率和随机替代的假设下，这个比例为

第一个位点：~5%
第三个位点：~72%

Step1: 计算同义位点和非同义位点总数。

: number of sysnonymous sites at a given codon
: number of nonsynonymous sites at a given codon
: fraction of synonymous changes at the th position of a given codon ()

则, .
例如，对于密码子TTA(Leu)，(TC), , (AG). 因此，.
对于有个密码子的DNA序列，同义位点总数，非同义位点总数. 其中是第个密码子的值。当对两个序列作比较时，使用两个序列和的均值。

Step2: 计算一对同源序列的同义和非同义核苷酸差异数量。

逐密码子比较两个序列，对同义和非同义核苷酸差异进行计数。

: number of synonymous differences per codon
: number of nonsynonymous differences per codon

一个密码子中只有一个核苷酸差异：可以立刻判断同义或非同义突变。例如比较GTT(Val)和GTA(Val)，只有一个同义的核苷酸差异，, .
一个密码子中有两个核苷酸差异：有2条可能路径。例如比较TTT和GTA，路径1，TTT(Phe)GTT(Val)GTA(Val); 路径2，TTT(Phe)TTA(Leu)GTA(Val). 路径1包含一个同义突变和一个非同义突变，路径2包含两个非同义突变。假设路径1和路径2以相同概率发生，那么, .
一个密码子中有三个核苷酸差异：有6条可能路径，每条路径包含3次突变。同理计算和.

则同义核苷酸差异总数，非同义核苷酸差异总数，其中和为第个密码子的和值，为比较的密码子数量。注意到等于两条比较的DNA序列的核苷酸差异的总数。

: proportion of synonymous differences

: proportion of nonsynonymous differences

其中和即为两条比较的序列同义位点数和非同义位点数的均值。

or : number of synonymous substitutions per synonymous site
or : number of nonsynonymous substitutions per nonsynonymous site

为估计和，使用以下由Jukes and Cantor (1969)提出的公式

其中为或. 这个方法只给出和的近似估计因为这个公式不适用于一些静默位点（twofold and threefold degenerate sites）。

尽管这个方法已经大大简化，但是当由很多密码子含多个核苷酸差异时计算仍然费时。以下为一个更简化的方法。

apply information obtained from the single-base-change to the multiple-base-change codons (by Lipman)

首先考虑只有一个核苷酸差异的密码子，对三个核苷酸位点分别计算和. 显然对于第二位点.

: proportion of synonymous differences
: proportion of nonsynonymous differences
其中.

对于有多个核苷酸差异的密码子对，对每个位点的核苷酸差异计数并乘以或.

: number of codon pairs in which the nucleotides differ only at the th position (Single base change)

在中有个同义差异核苷酸，则, .

: for the th position
: for the th position

则同义核苷酸差异总数

非同义核苷酸差异总数

Lipman提议使用在相同核苷酸频率和随机替代假设下5%第一个位点的同义突变频率和72%第三个位点的同义突变频率，则, .

计算机模拟表明Lipman方法中的和估计时接近真实值，但是其估计和的方法并不可靠，这是由于真实基因中密码子的频率常常与相同核苷酸频率的假设不符。

但幸运的是第一个方法估计同义位点数和非同义位点数较为容易，然后采用后者估计同义核苷酸差异总数和非同义核苷酸差异总数.

以上两种方法即为 **Nei and Gojobori (1986)**。但是使用Jukes and Cantor (1969)计算(or )和(or )有一些问题。第一，其忽略了四个核苷酸transition/transversion rate的偏差和差异。第二，将位点分为同义和非同义也是不Iran的，尤其当两个密码子有多个位点不同，会存在多个路径，这使得一个位点在整个进化过程中不能被直接归类为同义或非同义位点。第三，在密码子层面该公式不合理，因为例如一个非同义位点不会以相同的概率变为其它三个非同义位点。

软件codeml使用**codon substitution models (e.g., Goldman and Yang 1994)**，对蛋白质编码DNA序列进行极大似然（ML）分析，从而计算dN/dS。

codon-based model (Goldman and Yang 1994)

codon-based model
Markov process is used to describe substitutions between codons
transition/transversion rate bias and codon usage bias are allowed
selective restraints at the protein level are accommodated using physicochemical distances between aa coded for by the codons
transition/transversion rate ratio and synonymous/nonsynonymous substitution rate ratio

先前模型的数据单元为核苷酸或氨基酸，且假设进化上独立。DNA水平上包含更多信息，closely related sequences可以更好被区分；对于more-distantly related species，氨基酸可能比核苷酸更适用，可以去除一些随机噪声。但是氨基酸层面会损失信息，也并不清楚what levels of divergence the removal of noise might outweigh the loss of information and increase the accuracy of phylogenetic estimation.

显然，核苷酸序列分析更被喜欢-> devise a model of nucleotide substitution that uses simultaneously the nucleotide-level information in DNA sequences and knowledge of the genetic code and hence the amino acid-level information of synonymous (silent) and nonsynonymous (replacement) nucleotide substitutions.

modeling at the codon level, instead of at the nucleotide or aa levels.
Advantage:
- more sequence information
- more-realistic models incorporating previously ignored effects
  - lack of independence at neighboring sites within a codon triplet
  - differences in evolutionary rate at different codon positions
incorporate biologically meaningful factors such as transition/transversion bias, variability of a gene, and aa differences.
use in maximum-likelihood (m.l.) estimation of phylogenies
- models as tools for understanding molecular-sequence evolution

Initially assume no insertion or deletion -> if useful, could be modified to allow for insertions and deletions

continuous-time Markov process

general genetic code
- the states are 61 sense codons
- 3 nonsense (stop) codons are not considered in the model
  - as mutations to or from stop codons can be assumed to affect drastically the structure and function of the protein and therefore will rarely survive.

rate matrix , where represent the probability that codon will change to codon in a small interval .

Codons are written as taking values from 1 to 61, corresponding to the triplet , where each of , and is a nucleotide A, C, G, or T.

Elements are fixed so that the row sums of equal 0, allowing the solution
for the matrix whose elements give the probability that codon replaces codon after time (Cox and Miller 1977).

Only single-nucleotide substitutions are permitted to occur instantaneously, as mutations involving more than one positon will have probabilities of occurrence of order and should be ignored.
However, substitutions between any two codons is possible for any .
Each codon has at most nine “neighbors” to which it may change instantaneously.
- The rate at which each particular allowed substitution occurs is proportional to the (equilibrium) frequency () of the codon () being changed to.
- Rates of substitutions involving a transition (AG or CT) are multiplied by a factor .
  - directly affects the ratio of transition and transversion substitutions and is corporated to allow for the empirical finding that transitions often occur more frequently than do transversions.
- A multiplicative factor if the two codons code for different amino acids.
  - represents that aa coded for by codon .
  - physicochemical distances between 20 aa given by Grantham (1974), .
  - represents the variability of the gene or its tendency to undergo nonsynonymous substitution: is negatively correlated with the synonymous/nonsynonymous substitution rate ratio.
  - also implicitly enables the model to account for different substitution rates and nonindependence of substitutions at different codon positions.

is also scale s.t.
This scaling means that time and branch lengths in a tree are effectively measured as expected numbers of nucleotide substitutions per codon.

Formally, for codons and ,

if 2 or 3 of the pairs are different
if exactly 1 of the pairs is different, and that difference is a transversion
where is the scaling factor of ;
if exactly 1 of the pairs is different, and that difference is a transition
and .

Thus the Markov process specified is reversible, as

is a special form of the rate matrix for a general reversible process;
the equilibrium distribution is given by the frequencies , that is, for any and .

Use standard numerical algorithms for the diagonalization (eigensolution) of to calculate .

Effect of the parameter : measures the variability of the gene, high values of gives low (high variability gives relatively many replacement substitutions) and vice versa.

The synonymous substitution rate per codon is
The nonsynonymous rate per codon is
Since
we have
The ratio as a function of and of and the codon frequencies .
Values as are equal to expected values of if there are no selective effect caused by aa differences and if synonymous and nonsynonymous substitutions occur at equal rates.

Effect of the parameter : high values of gives high value of the ratio of instantaneous transition rate to transversion rate, , which can be calculated similarly.

We prefer the use of these instantaneous rate ratios to the use of estimates of the numbers of silent/replacement substitutions or transitions/transversions derived from “most parsimonious” character-state reconstructions, which will almost certainly underestimate numbers of changes.

From , the log-likelihood for a given tree topology can be calculated following Felsenstein (1981). The tree corresponding to the highest likelihood is the m.l. tree.

estimate from the averages of the observed codon frequencies.
, , and branch lengths in the tree are estimated by maximizing the likelihood, and standard deviations may be estimated by maximizing the curvature method.

As we assume (more for expediency) that the substitution process is homogeneous and stationary, and as we do not assume the existence of a molecular clock (i.e., we do not assume rate constancy over lineages), only unrooted trees can be estimated (Felsenstein 1981; Goldman 1991).

Calculation of in codon-based model
- and represent rates of synonymous and nonsynonymous substitutions per codon
- and corresponds to the valued of and when
  - and represent “potentials” of synonymous and nonsynonymous substitutions when no selective constraints exist at the amino acid level
  - and therefore represent the numbers of synonymous and nonsynonymous nucleotide sites per codon
- and for the m.l. values
  - those calculated using the m.l. estimates of and
- the numbers of synonymous substitutions per synonymous site
- the numbers of nonsynonymous substitutions per nonsynonymous site

The estimates of and are free from the problems of the method of Nei and Gojobori (1986).

account for different nucleotide frequencies at the three codon positions; incorporate codon usage information. account for the transition/transversion bias.
instataneous rates of substitutions between codons -> no confusion as to whether a substitution is synonymous or nonsynonymous.
and , and their ratio, are m.l. estimates.

dNdScv: poisson framework (Inigo Martincorena et al. 2017)

Mutations are classified according to their substitution type (depending on the substitution model) and functional impact (synonymous -s-, missense -m-, nonsense -n- and essential splice sites -e-). “Truncating substitutions” refers to nonsense and essential splice site substitutions together.

E.g., the number of C>T synonymous mutations ()
where is the density of substitutions per site, is the relative rate of C>T substitutions per site, and is the number of C sites in which a C>T change is synonymous.

One rate parameter of the substitution matrix is arbitrarily set to 1 so that all other rates are relative rates with respect to it. For non-synonymous sites, an extra parameter reflects the effect of selection on the accumulation of mutations:
The parameters are the dN/dS ratios inferred by the model after correcting for the rates of different substitution classes () and for sequence composition (). Maximum-likelihood estimates for all parameters in the model can be efficiently obtained by Poisson regression.

Suitable for
- cancer genomic data
- other resequencing studies
  - the density of mutations per site is low
  - human evolution and bacterial populations

A full trinucleotide model with 192 rate parameters, one for each of the possible trinucleotide substitution rates, to comprehensively avoid biases emerging from context-dependent effects (from one base upstream and downstream of the mutant base).
The adequacy of different substitutioin models in molecular evolution is often evaluated using Likelihood-Ratio Tests, Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).

Modeling variable mutation rates vary across genes: dNdScv combines dN/dS with a negative binomial regression on a a large number of covariates.

Gene size, gene sequence and the impact of the substitution model are all accounted for as an offset (reflecting the exposure of the gene).
The normalized mutation rate per site, , is modeled as Gamma-distributed across genes, reflecting the uncertainty in the variation of the mutation rate across genes remaining after accounting for the exposure of the gene.

1	model=glm.nb(n_syn~offset(log(expected_syn))+covariate_matrix)

where n_syn for gene is , expected_syn for gene is (with being constant across genes).

The likelihood for can be constrained both by the global knowledge of how the mutation rate varies across genes and the local number of synonymous mutations in the gene.

CIs for parameters under the dNdScv model were obtained by profile likelihood integrating out .

dN/dS distribution across genes is technical:

A composite of the true variation of selection across genes and Poisson noise in the counts of non-synonymous and synonymous mutations.
Neutral simulations revealed that most of the apparent variation observed in dN/dS across genes was technical, due to a limited number of substitutions per gene.
Mixture models to eliminate the technical variation.