de Bruijn graph
A de Bruijn graph is a compact representation based on short words (
A critical stage in genome sequencing is the assembly of shotgun reads, or piecing together fragments randomly extracted from the sample, to form a set of contiguous sequences (contigs) representing the DNA in the sample.
- traditional approach: overlap-layout-consensus
overlap-layout-consensus approach (Batzoglou 2005), representing each read as a node and each detected overlap as an arc between the appropriate nodes.
Very short reads are not well suited to this traditional approach. Because of their length, they must be produced in large quantities and at greater coverage depths than traditional Sanger sequencing projects. The sheer number of reads makes the overlap graph, with one node per read, extremely large and lengthy to compute.
- de Bruijn graphs
The fundamental data structure in the de Bruijn graph is based on kmers, not reads, thus high redundancy is naturally handled by the graph without affecting the number of nodes.
The above descriptions are from Velvet, next cf. SPAdes. It seems that SPAdes improves on Velvet.
de Bruijn graphs
An
Fix a positive integer
D1. Define an initial graph
D2. Glue vertices of
- Multisized de Bruijn graphs
A vertex v in a graph G is called a hub if indegree(v) ≠ 1 or outdegree(v) ≠ 1. A directed path in G is called a hub-path(abbreviated h-path) if its starting and ending vertices are hubs and its intermediate vertices are not hubs.
In the de Bruijn graph DB(Reads,
defines an
We define Reads
We define the coverage of an edge in the de Bruijn graph DB(Reads,
The choice of
Given a positive integer
The multisized de Bruijn graph, DB(Reads,
- Practical paired de Bruijn graphs: k-bimer adjustment
focus on a library with bireads having insert sizes…
- Euler circuit
Let C be an Eulerian cycle in a multigraph G. The distance between edges a and b in cycle C is defined as the number of edges between Start(a) and Start(b) in the cycle C (Start(e) and End(e) refer to starting and ending vertices of an edge e). We define the distance between a pair of h-paths as the distance between their h-edges. When cycle C (in a de Bruijn graph) corresponds to a genome, the distance between edges corresponds to the genomic distance.
biread,transformation等等看着太费劲了,暂时放一放
简单来讲,用reads的k-mer构建de Bruijn graph,然后找欧拉回路拼接genome
此为SPAdes assembly 理论部分 。
References:
1.SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3342519/
2.Velvet: Algorithms for de novo short read assembly using de Bruijn graphs