MAFFT --multiple sequence alignment based on FFT
- Homologous regions are rapidly identified by the fast Fourier transform (FFT)
- An amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue
- Simplified scoring system that perform well for reducing CPU time and increasing the accuracy of alignments
- For sequences having large insertions or extensions as well as distantly related sequences of similar length
- History
- Needleman and Wunsch: dynamic programming (DP)
- The generalization to multiple sequence alignment requires CPU time
, where is the number of sequences each with length .
- The generalization to multiple sequence alignment requires CPU time
- Heuristic methods
- progressive methods and iterative refinement methods
- combinations of successive two-dimensional DP, taking CPU time
- Needleman and Wunsch: dynamic programming (DP)
- Problems
- The accuracy of resulting alignments is greatly affected by the scoring system -> biologically correct?
- Cost of CPU time -> homology search programs
Group-to-group alignments by FFT
An amino acid
Calculation of correlation
The correlation
The correlation
Evaluating the Fourier transform directly takes
E.g., w.l.o.g., consider
The correlation
Finding homologous segments
If two sequences compared have homologous regions, the correlation
Now we know only the positional lag
A sliding window analysis with the window size of 30 sites is carried out, in which the degree of local homologies is calculated for each of the highest 20 peaks in the correlation
Identify a segment of 30 sites with score value exceeding a given threshold as a homologous segment.
Dividing a homology matrix
Arrange the homologous segments consistently in both sequence to obtain an alignment.
- Apply the standard DP procedure to matrix
to obtain the optimal path -> optimal arrangement of homologous segments. - Overall homology matrix is divided into some sub-matrices at the boundary corresponding to the center of homologous segments.
- Shaded area excluded from the calculation.
- As many homologous segments detected by FFT, CPU time reduced.
Extension to group-to-group alignments
The equations of
Similarly, polarity component is calculated as
The method is applicable to nucleotide sequences by converting a sequence to a sequence of four-dimensional vectors whose components are the frequencies of A, T, C and G.
Correlation between two nucleotide sequence is
Scoring matrix
A normalized similarity matrix
Homology matrix
Homology matrix between group 1 and group 2 is calculated as
In the NW algorithm
Each group of sequences may contain the gaps already introduced at previous steps, and these new gaps should not be penalized. Because these new and existing gaps are probably resulting from a single insertion or deletion event.
Computer programs
- FFT-NS-1
- Input sequences are progressively aligned following the branch order of sequences in the guide tree
- Jones et al. with two modifications
- Distance
between sequence and sequence as -> UPGMA method to construct the guide tree
- FFT-NS-2
- Realign along the guide tree inferred from the alignment by FFT-NS-1
- More reliable alignments are obtained on the basis of more reliable guide trees
- FFT-NS-i
- Tree-dependent restricted partitioning
- The process is repeated until no better scoring alignment is obtained
Reference:
Katoh, Kazutaka, et al. “MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.” Nucleic acids research 30.14 (2002): 3059-3066.