Local Alignment (BLAST) and Statistics

Posted on 2020-02-29 Edited on 2021-10-24

classical sequencing

ribonucleotide: a nucleotide containing ribose as its pentose component.

deoxy-: the monomer, or single unit, of DNA or deoxyribonucleic acid.

dideoxy-: a modified deoxy-. This group is needed for the next nucleotide to attach to a growing polynucleotide chain during DNA synthesis.

give DNA polymerase a template and some dideoxy
nucleotides. It won’t be able to extend because there’s no 3-prime OH.

dideoxynucleotide, a lot lower concentration

Traditional Sanger / chain termination sequencing (70s, 80s, 90s)

Large polyacrylamide gels, radiolabeled DNA, 4 lanes per read

Fluorescent-based / dye terminator sequencing (90s - present)

instead of using radio labeling on the primary, you use fluorophore on your dideoxy entities.
instead of having a big gel, they shrunk the gel. And then they just had a reader at the bottom. So the gel was shrunk to as thin as these little capillaries.– capillary sequencing

next-genor second-gen sequencing

A variety of technologies. Differ in aspects of:

DNA template
Modified nucleotides used
Imaging / image analysis

454(Roche):
based on emulsion PCR. little beads have adapter DNA molecules covalently attached. incubate the beads with DNA, and you actually make an emulsion.
Illumina/Solexa sequencing:
doing it on the surface of a flow cell. Again, you start with a single molecule of template. Your flow cell has these two types of adapters covalently attached. The template anneals to one of these adapters. You extend the adapter molecule with dNTPs and polymerase.-> bridge amplification

The DNA polymerase is covalently attached to the surface and the template is sort of threaded into the polymerase. And this is a phage polymerase that’s highly processive and strand displacing.

In the 454, measuring luciferase activity– light. In Illumina, measuring fluorescence. Four different fluorescent tags.

The most common type of error in 454 is actually insertions and deletions. Whereas in Illumina sequencing, it’s substitutions.

Local alignment

find shorter stretches of high similarity
don’t require alignment of whole sequence

matching them up and finding individual bases or amino acid residues that match:
-> assemble
-> Looking at homologs

. BLAST

Algorithm

scoring system: the top score

then it’s exactly the MaxSubsequenceSum algorithm in data structure

On-line algorithm

int MaxSubsequenceSum( const int  A[ ],  int  N ) 
{ 
	int  ThisSum, MaxSum, j; 
/* 1*/ 	ThisSum = MaxSum = 0; 
/* 2*/ 	for ( j = 0; j < N; j++ ) { 
/* 3*/ 	      ThisSum += A[ j ]; 
/* 4*/ 	      if  ( ThisSum > MaxSum ) 
/* 5*/ 		MaxSum = ThisSum; 
/* 6*/ 	      else if ( ThisSum < 0 ) 
/* 7*/ 		ThisSum = 0;
	}  /* end for-j */
/* 8*/ 	return MaxSum; 
}

time complexity: O(MN)
M: reads length, N: sequece length

Determining significance of nucleotide local alignments

Identify high scoring segments whose score S exceeds
a cutoff x using a local alignment algorithm (e.g., BLAST)
Scores follow an extreme value (aka Gumbel) distribution:

For sequences/databases of length , where , are positive.

is the unique positive solution to the equation:

= freq. of nt in query, = freq. of nt j in subject
= score for aligning an pair
the nature of : scaling factor

Figuring out how to choose the mismatch penalty …
Optimal mismatch penalty for given target identity fraction

Examples:

	0.75	0.95	0.99
	-1	-2	-3

= expected fraction of identities in high-scoring BLAST hits

Reference:
Christopher Burge, David Gifford, and Ernest Fraenkel. *7.91J Foundations of Computational and Systems Biology. *Spring 2014. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.