Estimating dN/dS
Estimate the Numbers of Synonymous and Nonsynonymous Nucleotide Substitutions
同义(synonymous)突变:silent
非同义(nonsynonymous)突变:核苷酸改变
由于同义突变的频率远高于非同义突变的频率,且对于不同基因相似,同义替换可用作确定密切相关物种进化时间的分子钟。
- 当每个密码子的突变数量只有一个,同义突变和非同义突变的数量只需要通过计数可得。
- 但是如果一个密码子中有两个或三个位点发生突变,区分同义突变和非同义突变就不是那么容易,存在不同的突变路径。
- Perler et al. (1980)统计方法,不同进化路径赋予相同权值;
- Miyata and Yasunaga (1980) and Li et al. (1985)更复杂的方法,相对于amino acid-altering substitutions,赋予包含silent substitutions的进化路径更大的权值。
但以上方法都有些复杂,以下先介绍 Nei and Gojobori (1986) 无权值方法(unweighted version of Miyata and Yasunaga’s)能对同义突变和非同义突变的数量达到同样准确的估计。进一步,当每个位点的核苷酸替代数量相对较少时,有一个更粗糙的方法能给出足够准确的估计。
但是单一利用核苷酸替代或氨基酸替代信息会造成一些问题,因此介绍 Codon-based model (Goldman and Yang 1994) 以及 Poisson framework: dNdScv (Inigo Martincorena et al. 2017).
modified (unweighted) version of Miyata and Yasunanga’s (1980) method
genetic code table 表明密码子第二个位点的核苷酸改变必然会导致氨基酸改变,而部分第一个和第三个位点的核苷酸改变是同义突变。在相同核苷酸频率和随机替代的假设下,这个比例为
- 第一个位点:~5%
- 第三个位点:~72%
Step1: 计算同义位点和非同义位点总数。
则
例如,对于密码子TTA(Leu),
对于有
Step2: 计算一对同源序列的同义和非同义核苷酸差异数量。
逐密码子比较两个序列,对同义和非同义核苷酸差异进行计数。
- 一个密码子中只有一个核苷酸差异:可以立刻判断同义或非同义突变。例如比较GTT(Val)和GTA(Val),只有一个同义的核苷酸差异,
, . - 一个密码子中有两个核苷酸差异:有2条可能路径。例如比较TTT和GTA,路径1,TTT(Phe)
GTT(Val) GTA(Val); 路径2,TTT(Phe) TTA(Leu) GTA(Val). 路径1包含一个同义突变和一个非同义突变,路径2包含两个非同义突变。假设路径1和路径2以相同概率发生,那么 , . - 一个密码子中有三个核苷酸差异:有6条可能路径,每条路径包含3次突变。同理计算
和 .
则同义核苷酸差异总数
其中
为估计
其中
尽管这个方法已经大大简化,但是当由很多密码子含多个核苷酸差异时计算仍然费时。以下为一个更简化的方法。
apply information obtained from the single-base-change to the multiple-base-change codons (by Lipman)
首先考虑只有一个核苷酸差异的密码子,对三个核苷酸位点分别计算
其中
对于有多个核苷酸差异的密码子对,对每个位点的核苷酸差异计数并乘以
在
则同义核苷酸差异总数
非同义核苷酸差异总数
Lipman提议使用在相同核苷酸频率和随机替代假设下5%第一个位点的同义突变频率和72%第三个位点的同义突变频率,则
计算机模拟表明Lipman方法中的
但幸运的是第一个方法估计同义位点数
以上两种方法即为 **Nei and Gojobori (1986)**。但是使用Jukes and Cantor (1969)计算
软件codeml
使用**codon substitution models (e.g., Goldman and Yang 1994)**,对蛋白质编码DNA序列进行极大似然(ML)分析,从而计算dN/dS。
codon-based model (Goldman and Yang 1994)
- codon-based model
- Markov process is used to describe substitutions between codons
- transition/transversion rate bias and codon usage bias are allowed
- selective restraints at the protein level are accommodated using physicochemical distances between aa coded for by the codons
- transition/transversion rate ratio and synonymous/nonsynonymous substitution rate ratio
先前模型的数据单元为核苷酸或氨基酸,且假设进化上独立。DNA水平上包含更多信息,closely related sequences可以更好被区分;对于more-distantly related species,氨基酸可能比核苷酸更适用,可以去除一些随机噪声。但是氨基酸层面会损失信息,也并不清楚what levels of divergence the removal of noise might outweigh the loss of information and increase the accuracy of phylogenetic estimation.
显然,核苷酸序列分析更被喜欢-> devise a model of nucleotide substitution that uses simultaneously the nucleotide-level information in DNA sequences and knowledge of the genetic code and hence the amino acid-level information of synonymous (silent) and nonsynonymous (replacement) nucleotide substitutions.
- modeling at the codon level, instead of at the nucleotide or aa levels.
- Advantage:
- more sequence information
- more-realistic models incorporating previously ignored effects
- lack of independence at neighboring sites within a codon triplet
- differences in evolutionary rate at different codon positions
- incorporate biologically meaningful factors such as transition/transversion bias, variability of a gene, and aa differences.
- use in maximum-likelihood (m.l.) estimation of phylogenies
- models as tools for understanding molecular-sequence evolution
Initially assume no insertion or deletion -> if useful, could be modified to allow for insertions and deletions
continuous-time Markov process
- general genetic code
- the states are 61 sense codons
- 3 nonsense (stop) codons are not considered in the model
- as mutations to or from stop codons can be assumed to affect drastically the structure and function of the protein and therefore will rarely survive.
Codons are written as
Elements
- Only single-nucleotide substitutions are permitted to occur instantaneously, as mutations involving more than one positon will have probabilities of occurrence of order
and should be ignored. - However, substitutions between any two codons is possible for any
. - Each codon has at most nine “neighbors” to which it may change instantaneously.
- The rate at which each particular allowed substitution occurs is proportional to the (equilibrium) frequency (
) of the codon ( ) being changed to. - Rates of substitutions involving a transition (A
G or C T) are multiplied by a factor . directly affects the ratio of transition and transversion substitutions and is corporated to allow for the empirical finding that transitions often occur more frequently than do transversions.
- A multiplicative factor
if the two codons code for different amino acids. represents that aa coded for by codon . - physicochemical distances
between 20 aa given by Grantham (1974), . represents the variability of the gene or its tendency to undergo nonsynonymous substitution: is negatively correlated with the synonymous/nonsynonymous substitution rate ratio. also implicitly enables the model to account for different substitution rates and nonindependence of substitutions at different codon positions.
- The rate at which each particular allowed substitution occurs is proportional to the (equilibrium) frequency (
Formally, for codons
if 2 or 3 of the pairs
are different if exactly 1 of the pairs
is different, and that difference is a transversion where is the scaling factor of ; if exactly 1 of the pairs
is different, and that difference is a transition and
.
Thus the Markov process specified is reversible, as
is a special form of the rate matrix for a general reversible process; - the equilibrium distribution is given by the frequencies
, that is, for any and .
Use standard numerical algorithms for the diagonalization (eigensolution) of
Effect of the parameter
- The synonymous substitution rate per codon is
- The nonsynonymous rate per codon is
Since we have - The ratio
as a function of and of and the codon frequencies . - Values as
are equal to expected values of if there are no selective effect caused by aa differences and if synonymous and nonsynonymous substitutions occur at equal rates.
Effect of the parameter
We prefer the use of these instantaneous rate ratios to the use of estimates of the numbers of silent/replacement substitutions or transitions/transversions derived from “most parsimonious” character-state reconstructions, which will almost certainly underestimate numbers of changes.
From
- estimate
from the averages of the observed codon frequencies. , , and branch lengths in the tree are estimated by maximizing the likelihood, and standard deviations may be estimated by maximizing the curvature method.
As we assume (more for expediency) that the substitution process is homogeneous and stationary, and as we do not assume the existence of a molecular clock (i.e., we do not assume rate constancy over lineages), only unrooted trees can be estimated (Felsenstein 1981; Goldman 1991).
- Calculation of
in codon-based model and represent rates of synonymous and nonsynonymous substitutions per codon and corresponds to the valued of and when and represent “potentials” of synonymous and nonsynonymous substitutions when no selective constraints exist at the amino acid level and therefore represent the numbers of synonymous and nonsynonymous nucleotide sites per codon
and for the m.l. values - those calculated using the m.l. estimates of
and
- those calculated using the m.l. estimates of
- the numbers of synonymous substitutions per synonymous site
- the numbers of nonsynonymous substitutions per nonsynonymous site
The estimates of
account for different nucleotide frequencies at the three codon positions; incorporate codon usage information. account for the transition/transversion bias. - instataneous rates of substitutions between codons -> no confusion as to whether a substitution is synonymous or nonsynonymous.
and , and their ratio, are m.l. estimates.
dNdScv: poisson framework (Inigo Martincorena et al. 2017)
Mutations are classified according to their substitution type (depending on the substitution model) and functional impact (synonymous -s-, missense -m-, nonsense -n- and essential splice sites -e-). “Truncating substitutions” refers to nonsense and essential splice site substitutions together.
E.g., the number of C>T synonymous mutations (
One rate parameter of the substitution matrix is arbitrarily set to 1 so that all other rates are relative rates with respect to it. For non-synonymous sites, an extra parameter
- Suitable for
- cancer genomic data
- other resequencing studies
- the density of mutations per site is low
- human evolution and bacterial populations
A full trinucleotide model with 192 rate parameters, one for each of the possible trinucleotide substitution rates, to comprehensively avoid biases emerging from context-dependent effects (from one base upstream and downstream of the mutant base).
The adequacy of different substitutioin models in molecular evolution is often evaluated using Likelihood-Ratio Tests, Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).
Modeling variable mutation rates vary across genes: dNdScv
combines dN/dS with a negative binomial regression on a a large number of covariates.
- Gene size, gene sequence and the impact of the substitution model are all accounted for as an offset (reflecting the exposure of the gene).
- The normalized mutation rate per site,
, is modeled as Gamma-distributed across genes, reflecting the uncertainty in the variation of the mutation rate across genes remaining after accounting for the exposure of the gene.
1 | model=glm.nb(n_syn~offset(log(expected_syn))+covariate_matrix) |
where n_syn
for gene expected_syn
for gene
The likelihood
CIs for dNdScv
model were obtained by profile likelihood integrating out
dN/dS distribution across genes is technical:
- A composite of the true variation of selection across genes and Poisson noise in the counts of non-synonymous and synonymous mutations.
- Neutral simulations revealed that most of the apparent variation observed in dN/dS across genes was technical, due to a limited number of substitutions per gene.
- Mixture models to eliminate the technical variation.