MAFFT --multiple sequence alignment based on FFT

Posted on 2021-08-25 Edited on 2021-10-23

Homologous regions are rapidly identified by the fast Fourier transform (FFT)
- An amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue
Simplified scoring system that perform well for reducing CPU time and increasing the accuracy of alignments
- For sequences having large insertions or extensions as well as distantly related sequences of similar length

History
- Needleman and Wunsch: dynamic programming (DP)
  - The generalization to multiple sequence alignment requires CPU time , where is the number of sequences each with length .
- Heuristic methods
  - progressive methods and iterative refinement methods
  - combinations of successive two-dimensional DP, taking CPU time

Problems
- The accuracy of resulting alignments is greatly affected by the scoring system -> biologically correct?
- Cost of CPU time -> homology search programs

Group-to-group alignments by FFT

An amino acid is assigned to a vector composed of the volume value and the polarity value . Use the normalized values: and .

Calculation of correlation

The correlation between two sequences as
where and are the correlations of volume component and polarity component. represents the degreee of similarity of two sequences with the positional lag of sites.

The correlation of volume component is defined as
where and are the volume component of the th site of sequence 1 with the length of and sequence 2 with the length of (). It takes operations.

Evaluating the Fourier transform directly takes time.
E.g., w.l.o.g., consider ,
There are outputs , and each output requires a sum of terms. FFT can compute the same results in .

The correlation of polarity component
is calculated similarly.

Finding homologous segments

If two sequences compared have homologous regions, the correlation has some peaks corresponding to these regions.
Now we know only the positional lag of a homologous region in two sequences but not the position of the region.
A sliding window analysis with the window size of 30 sites is carried out, in which the degree of local homologies is calculated for each of the highest 20 peaks in the correlation .
Identify a segment of 30 sites with score value exceeding a given threshold as a homologous segment.

Dividing a homology matrix

Arrange the homologous segments consistently in both sequence to obtain an alignment.
( is the number of homologous segments): score value of the homologous segments, if the th homologous segment on sequence 1 corresponds to the th homologous segment on sequence 2; otherwise .

Apply the standard DP procedure to matrix to obtain the optimal path -> optimal arrangement of homologous segments.
Overall homology matrix is divided into some sub-matrices at the boundary corresponding to the center of homologous segments.
- Shaded area excluded from the calculation.
- As many homologous segments detected by FFT, CPU time reduced.

homology matrix

Extension to group-to-group alignments

The equations of and are extended to group-to-group alignment by replacing with , which is the linear combination of the volume components of the members belonging to group 1:
where is the weighting factor for sequence .
Similarly, polarity component is calculated as

The method is applicable to nucleotide sequences by converting a sequence to a sequence of four-dimensional vectors whose components are the frequencies of A, T, C and G.
Correlation between two nucleotide sequence is

Scoring matrix

A normalized similarity matrix , where a and b are amino acids, that has both positive and negative values
where , is raw similarity matrix (200 PAM log-odds matrix by Jones et al.), is the frequency of occurrence of amino acid , and is a parameter that functions as a gap extension penalty.

Homology matrix between two amino acid sequences and is constructed from the similarity matrix as , where and are positions in sequences.

Homology matrix between group 1 and group 2 is calculated as
where indicates the th site of the th sequence in group 1, is the th site of the th sequence in group 2.

In the NW algorithm
, where , and . is the accumulated score for the optimal path from to , and is gap penalty.

Each group of sequences may contain the gaps already introduced at previous steps, and these new gaps should not be penalized. Because these new and existing gaps are probably resulting from a single insertion or deletion event.
where corresponds to a gap opening penalty, is the number of the gaps that starts at the th site, and is the number of the gaps that end at the th site.
and
where and , if the th site of sequence is a gap; otherwise and .

Computer programs

FFT-NS-1
- Input sequences are progressively aligned following the branch order of sequences in the guide tree
- Jones et al. with two modifications
- Distance between sequence and sequence as -> UPGMA method to construct the guide tree
FFT-NS-2
- Realign along the guide tree inferred from the alignment by FFT-NS-1
- More reliable alignments are obtained on the basis of more reliable guide trees
FFT-NS-i
- Tree-dependent restricted partitioning
- The process is repeated until no better scoring alignment is obtained

Reference:
Katoh, Kazutaka, et al. “MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.” Nucleic acids research 30.14 (2002): 3059-3066.