Reading--SNP calling from HTS
A beginner guide to SNP calling from high-throughput DNA-sequencing data
2012, Human Genetics
- the objective is to identify genetic variants such as single nucleotide polymorphism (SNP) from high-throughput DNA sequencing (HTS) data.
- pipeline: 1. quality control 2. mapping of short reads to the reference genome 3. visualization and post-processing of the alignment including base quality recalibration 4. SNP calling procedure along with filtering of SNP candidates
workflow of the SNP calling pipeline
Step1: Base Calling
Input: Images from Sequencer
Tools: International HTS System Software
Output: Base- or Color- Sequence and Quality Scores -> e.g. fastq
Step2: Quality Control
Input: Read Data -> e.g. fastq
Tools: SolexaQA, FastQC, PRINSEQ
Output: Quality Report and Filtered Reads -> e.g. fastq
Step3: Alignmnet/Mapping
Input: Filtered Read Data -> e.g. fastq
Tools: BWA, MAQ, Stampy, Bowtie, SHRiMP2, bfast
Output: SAM, BAM and Mapping Statistics
Step4: Alignment Post-Processing
Input: SAM/BAM
Tools: samtools, Picard, SMRA, GATK
Output: SAM/BAM
Step5: Quality Score Recalibration
Input: SAM/BAM
Tools: SOAPsnp, GATK
Output: SAM/BAM
Step6: Variant and Genotype Calling
Input: SAM/BAM
Tools: SOAPsnp, MAQ, samtools, GATK, Beagle
Output: vcf
Step7: Filtering SNP Candidates
Input: vcf
Tools: GATK, samtools, VCF tool
Output: vcf
Notes
- Compared to Sanger sequencing, HTS produces much more sequences, but of shorter length and inferior quality.
- The limitation of short read length can be circumvented using protocols that allow the generation of read pairs with a known distance between these pairs ( insert length), and a known orientation to the reference sequence. -> paired-end sequencing or mate-pairs.
- Tool for HTS so as to:
- determine the nucleotide sequence of a target region and identify SNP
- paired reads facilitate the investigation of larger structural variants such as inversions, deletions, and insertions
- convert mRNA to cDNA, identify novel transcripts, splice variants, and to quantify expression levels of even lowly expressed genes
- The error probability increases with increasing read length -> Read trimming applied to increase the number of mappable reads by removing bases at the end of the read that are likely to contain sequencing errors.
- Two algorithms in alignment/mapping:
- Burrows-Wheeler transform (BWT) for efficient data compression
- hashing: quick access to the information on the location of subsequences in the reference sequence