Reading--alignment and SNP calling algorithms
Review of alignment and SNP calling algorithms for next-generation sequencing data
2015, J Appl Genetics
- Bioinformatic tools for next-generation sequencing (NGS) data processing.
- Two of most significant tasks:
- alignment to a reference genome -> suffix tries and hash tables. Suffix array-based aligners are memory efficient and work faster than hash-based aligners, but they are less accurate. Hash table algorithms tend to be slower but more sensitive.
- detection of single nucleotide polymorphisms (SNPs) -> heuristic and probabilistic methods. Due to the computational demands of heuristic methods, probabilistic methods are more commonly used.
Introduciton
Existing NGS technologies provide us with a throughput that is at least 100 times larger than that obtained by Sanger sequencing, and the technologies are still improving. It led to a tremendous increase of sequencing speed and thus obtaining complete data of any species within a short period.
A major advantage of whole genome sequencing (WGS) stems from the potential to detect genetic variants which mark individual diviations from the reference genome. These are not only SNPs, but also small insertions/deletions (InDels), as well as large structual variants (SV), e.g. copy number variations (CNVs).
DIFFICULTIES: NGS represents a throuput technology, so it is highly sensitive to technological errors, consequently making the process of utilising NGS data for research highly dependent on reliable bioinformatics tools. Two major steps: alignment and detection of SNPs.
Alignment to the reference genome
First computaional step in NGS analysis: alignment of short reads to the existing reference genome or, alternatively, a de novo assembly without a reference.
Indexing before alignment: most aligners construct indices for the reference sequence and/or for the short sequenced read data set. Indexing the reference genome is most common because it is andvantageous in terms of computational time.
Two main indexing algorithms: suffix/prefix tries or hash tables incoporated in vast majority of software tools, but exceptions exist, e.g. Slider sofware adopts merge-sorting.
Suffix/prefix tries
- suffix/prefix trie
A data structure representing the set of suffixes of a given string of characters, enabling fast match of the string, which in our case represents a sequence read to the reference genome. However, it is not efficient enough, resulting in a very long computing time for very large NGS data sets.
- enhanced suffix array
More efficient and enables storing large genomes in computer memory. It is composed of a suffix array and some auxiliary arrays.
- FM-index
About BWT and FM-index, see Hc’s Blog: Short Read Alignment.
Combine the Burrows-Wheeler transform (BWT) algorithm with auxiliary data structures. A suffix array created from BWT altered sequences is significantly more efficient than that from primary sequence. The most popular software tools implementing the FM-index algorithm with BWT are BWA, Bowtie, Bowtie2, SOAP3 and SOAP3-dp (2-BWT, with some modification).
Hash tables
About basic conception of Hashing, see Hc’s Blog: Hashing.
Seed-and-extend method: divide a query sequence into words called k-mers and the position of each k-mer is kept in the hash table. k-mers are sought in another sequence and their exact matches are termed seeds. Afterwards, seeds are extended and joined without gaps and then their alignment is refined using the Smith-Waterman alignment algorithm.
- spaced seed
- q-gram filter and multiple seed hits
- seed extension
First two rely on the fast lookup in a hash table, while the seed extension algorithm is based on accelerating the standard Smith-Waterman alignment algorithm.
Alignment
Regardless of the indexing method, the actual alignment is performed using either the Smith-Waterman or the Needle-Wunsch algorithms.
Alignment gaps usually result from small-scale genome rearrangements, such as InDels. Thus, allowing for gaps in the alignment is preferred.
And currently, all of the commonly use alignment programs support a paired-end alignment mode, in which reads sequenced in both the forward and reverse orders are considered simultaneously as a pair, providing a longer piece of sequence to be aligned and, thus, significantly inproving the quality of alignment. See Hc’s Blog: Illumina Sequencing (paired-end).
Post-alignment processing
- output file format converting
Reduction of output data set size and guaranteen downstream compatibility with variant callers.
- create reports from the alignment process
A simple summary including the total number of aligned reads, the number of properly aligned read pairs of reads or the number of reads aligned exactly once.
- remove polymerase chain reaction (PCR) artefacts
The same read pairs occuring many times in the raw data and changing the estimate of the true coverage of the reference genome. (samtools rmdup tool is an example)
SNP calling
Pre-variant calling process involve alignment artefact correction and sequence quality score recalibration. And for a reliable variant detection, a high depth of coverage is a prerequisite, since a high number of reads aligned to each base is used to differentiate between sequencing errors and true polymorphisms.
SNP calling may be defined as the process of identifying sites differing from a reference sequence, while genotype calling refers to the estimation of genotypes. They are based on either heuristic or probabilistic methods.
Heuristic methods
- Based on multiple information sources associated with the structure and quality of data.
- High computational demands -> less commonly used.
- For variant detection, a heuristic algorithm determineds the genotype based on the thresholds for coverage (minimum 33), base quality (minimum 20) and variant allele frequency (minimum 0.08).
- Followed by statistical test. Fisher’s exact test of read counts supporting each allele is applied compared to the expected distribution based on sequencing error alone.
Probabilistic methods
- Based on genotype likelihood calculations and adopts Bayes’ theorem.
- Provide measures of statistical uncertainty for called genotypes, making it possible to monitor the accuracy of genotype calling.
- Additional information concerning allele frequencies and linkage disequilibrium patterns may be included in the analysis.
In the Bayesian framework, the computed likelihood is combined with a prior genotype probability, which leads to a posterior probability of a genotype. As a result, the genotype with the highest posterior probability is chosen. The ratio between the highest and the second highest probabilities may be used as a measure of confidence.
A uniform prior probability may be selected as equal for all genotypes; alternatively, a non-uniform prior can be pre-imposed based on additional information, such as dbSNP (SNP database) entries, the reference sequence structures or features provided by the analysed sample.
Single- vd. multi-sample SNP calling
- Single
Advantage: fast.
Disadvange: difficult to draw inferences on a population-wide level and requires custom written programs.
- Multi
Advantage: priors may be improved based on the information on allele frequency and on verification of whether the obtained genotypes follow Hardy-Weinberg equilibrium; the latter is only useful if unrelated individuals are considered.
Disadvantage: much more CPU time- and resource- consuming than individual variant calling.
Technical issues
Data formats
Format | Data |
---|---|
FASTQ | short sequence reads |
FASTA | reference genomes |
SAM | output of alignment |
BAM | binaray file of SAM |
VCF | output of variant calling |
Computing platforms and operating systems
With the decrease in sequencing costs, data processing costs are rising, while storing and processing very large amounts of data become a challenge.
- large data sets -> memory requirements.
- computationally demanding -> supercomputing infrastructure involving many CPUs.
- accelerate NGS processing -> multi-threaded mode.
Notes
For less experienced users, SAMtools’ mpileup and bcftools are recommended because their ease of use. On the other hand, GATK, which is provided with a detailed user manual, is robust towards data artifacts. Nevertheless, both programs are frequently updated, provide detailed log files describing the process of SNP calling and the agreement between their outputs is high.