Reading--SNP calling from HTS

Posted on 2020-05-13 Edited on 2020-05-19

A beginner guide to SNP calling from high-throughput DNA-sequencing data

2012, Human Genetics

the objective is to identify genetic variants such as single nucleotide polymorphism (SNP) from high-throughput DNA sequencing (HTS) data.
pipeline: 1. quality control 2. mapping of short reads to the reference genome 3. visualization and post-processing of the alignment including base quality recalibration 4. SNP calling procedure along with filtering of SNP candidates

Input: Images from Sequencer
Tools: International HTS System Software
Output: Base- or Color- Sequence and Quality Scores -> e.g. fastq

Input: Read Data -> e.g. fastq
Tools: SolexaQA, FastQC, PRINSEQ
Output: Quality Report and Filtered Reads -> e.g. fastq

Input: Filtered Read Data -> e.g. fastq
Tools: BWA, MAQ, Stampy, Bowtie, SHRiMP2, bfast
Output: SAM, BAM and Mapping Statistics

Input: SAM/BAM
Tools: samtools, Picard, SMRA, GATK
Output: SAM/BAM

Input: SAM/BAM
Tools: SOAPsnp, GATK
Output: SAM/BAM

Input: SAM/BAM
Tools: SOAPsnp, MAQ, samtools, GATK, Beagle
Output: vcf

Input: vcf
Tools: GATK, samtools, VCF tool
Output: vcf

Compared to Sanger sequencing, HTS produces much more sequences, but of shorter length and inferior quality.
The limitation of short read length can be circumvented using protocols that allow the generation of read pairs with a known distance between these pairs ( insert length), and a known orientation to the reference sequence. -> paired-end sequencing or mate-pairs.
Tool for HTS so as to:
- determine the nucleotide sequence of a target region and identify SNP
- paired reads facilitate the investigation of larger structural variants such as inversions, deletions, and insertions
- convert mRNA to cDNA, identify novel transcripts, splice variants, and to quantify expression levels of even lowly expressed genes
The error probability increases with increasing read length -> Read trimming applied to increase the number of mappable reads by removing bases at the end of the read that are likely to contain sequencing errors.
Two algorithms in alignment/mapping:
- Burrows-Wheeler transform (BWT) for efficient data compression
- hashing: quick access to the information on the location of subsequences in the reference sequence