WGS and Tools

Posted on 2020-02-23 Edited on 2020-03-17

Whole-genome sequencing (WGS) is a comprehensive method for analyzing entire genomes.–commonly associated with sequencing human genomes

Next-generation sequencing (NGS) technology–useful for sequencing any species, such as agriculturally important livestock, plants, or disease-related microbes.

WGS process

1.DNA extraction: After bacterial culture, scientists take bacterial cells from an agar plate and treat them with chemicals that break them open, releasing the DNA. The DNA is then purified.
2.DNA shearing: using molecular scissors(enzymes) to cut the DNA, which is composed of millions of bases: A’s, C’s, T’s and G’s, into pieces that are small enough for the sequencing machine to read.
3.Tagmentation and PCR amplification: Scientists make many copies of each DNA fragment using a process called polymerase chain reaction (PCR). In PCR, Scientists add small pieces of DNA tags, or bar codes, to identify which piece of sheared DNA belongs to which bacteria. This is similar to how a bar code identifies a product at a grocery store. The pool of fragments generated in a PCR machine is called a DNA library.
4.DNA Library Sequencing: The DNA library is loaded onto a sequencer. The combination of nucleotides (A, T, C, and G) making up each individual fragment of DNA is determined, and each result is called a DNA read.
5.DNA Sequence Analysis: The sequencer produces millions of DNA reads and specialized computer programs are used to put them together in the correct order like pieces of a jigsaw puzzle. When completed, the genome sequence containing millions of nucleotides (in one or a few large pieces) is ready for further analysis.

DNA reads-> DNA sequence analysis

e.g. SNP-Calling and Identification of Major Lineages

1.Reads are trimmed and filtered using Cutadapt (Martin, 2011) and Sickle (Joshi and Fass, 2011) (pe -q 20 -l 50)
2.Aligned using Bowtie2 (Alignment parameters: -X 2000 –no-mixed –very-sensitive –n-ceil 0,0.01 –un-conc) (Langmead and Salzberg, 2012).

Isolates for which more than 70% of reads aligned to the reference and which had average coverage of greater than 10 reads across the genome were included for analysis.

3.Candidate SNPs were identified using SAMtools (Li et al., 2009) and filtered using methods from previous work (Lieberman et al., 2014).

For each SNP position identified, a nucleotide call was assigned to each isolate using the major allele call across reads for that isolate at that position. If fewer than 7 reads aligned to either forward or reverse strand of a position in an isolate, or the major allele frequency was smaller than 90%, an ambiguous call was assigned to the isolate at that SNP position.

4.Generate a neighbor-joining tree from the concatenated list of variable positions from conserved genomic regions present in all B. fragilis isolates from all subjects.

Software and Algorithms	Function	Citation	URL
Cutadapt (version 1.9.1)	Trim reads	Martin, 2011	https://cutadapt.readthedocs.io/en/stable/
Sickle	Filter	Joshi and Fass, 2011	https://github.com/najoshi/sickle
Bowtie2 (version 2.2.6)	Alignment	Langmead and Salzberg, 2012	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SAMtools (version 1.2)	Identify candidate SNPs	Li et al., 2009	http://samtools.sourceforge.net/
Spades (version 3.10.0)	Assembler	Bankevich et al., 2012	https://github.com/ablab/spades

Quantification and Statistical Analysis
Statistical significance–Fisher’s exact text, Mann-Whitney U-test, Chi-squared test, Binomial test.

References:
1.https://www.illumina.com/techniques/sequencing/dna-sequencing/whole-genome-sequencing.html
2.https://www.cdc.gov/pulsenet/pdf/Genome-Sequencing-508c.pdf
3.https://www.cdc.gov/pulsenet/pathogens/wgs.html
4.Adaptive Evolution within Gut Microbiomes of Healthy People https://www.sciencedirect.com/science/article/pii/S1931312819301593
5.A step-by-step beginner’s protocol for whole genome sequencing of human bacterial pathogens https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6706130/