Reading--SNP calling from HTS

A beginner guide to SNP calling from high-throughput DNA-sequencing data

2012, Human Genetics

  • the objective is to identify genetic variants such as single nucleotide polymorphism (SNP) from high-throughput DNA sequencing (HTS) data.
  • pipeline: 1. quality control 2. mapping of short reads to the reference genome 3. visualization and post-processing of the alignment including base quality recalibration 4. SNP calling procedure along with filtering of SNP candidates

workflow of the SNP calling pipeline

Step1: Base Calling

Input: Images from Sequencer
Tools: International HTS System Software
Output: Base- or Color- Sequence and Quality Scores -> e.g. fastq

Step2: Quality Control

Input: Read Data -> e.g. fastq
Tools: SolexaQA, FastQC, PRINSEQ
Output: Quality Report and Filtered Reads -> e.g. fastq

Step3: Alignmnet/Mapping

Input: Filtered Read Data -> e.g. fastq
Tools: BWA, MAQ, Stampy, Bowtie, SHRiMP2, bfast
Output: SAM, BAM and Mapping Statistics

Step4: Alignment Post-Processing

Input: SAM/BAM
Tools: samtools, Picard, SMRA, GATK
Output: SAM/BAM

Step5: Quality Score Recalibration

Input: SAM/BAM
Tools: SOAPsnp, GATK
Output: SAM/BAM

Step6: Variant and Genotype Calling

Input: SAM/BAM
Tools: SOAPsnp, MAQ, samtools, GATK, Beagle
Output: vcf

Step7: Filtering SNP Candidates

Input: vcf
Tools: GATK, samtools, VCF tool
Output: vcf

Notes

  • Compared to Sanger sequencing, HTS produces much more sequences, but of shorter length and inferior quality.
  • The limitation of short read length can be circumvented using protocols that allow the generation of read pairs with a known distance between these pairs ( insert length), and a known orientation to the reference sequence. -> paired-end sequencing or mate-pairs.
  • Tool for HTS so as to:
    • determine the nucleotide sequence of a target region and identify SNP
    • paired reads facilitate the investigation of larger structural variants such as inversions, deletions, and insertions
    • convert mRNA to cDNA, identify novel transcripts, splice variants, and to quantify expression levels of even lowly expressed genes
  • The error probability increases with increasing read length -> Read trimming applied to increase the number of mappable reads by removing bases at the end of the read that are likely to contain sequencing errors.
  • Two algorithms in alignment/mapping:
    • Burrows-Wheeler transform (BWT) for efficient data compression
    • hashing: quick access to the information on the location of subsequences in the reference sequence