bwa和bowtie是常用的用于比对的软件,使用来看bwa mem
比对的敏感度比bowtie
更高(可能是高很多),而且有一段read比对到多段contigs (primary and supplementary) 的功能。
提取双端比对序列
只需要在用完相应的比对软件后,使用samtools view
,-F
选项删除flag对应的reads,flag=4(0x4)代表这条read没有比对到参考序列上,flag=8(0x8)代表这是单端比对上的read,12=4+8,因此使用-F 12
可以保留双端比对上的序列。
bwa mem
1 2 3 4 5 6 7 8 9
| bwa index $ref
bwa mem -t 8 $ref $InDir/$fq1 $InDir/$fq2 > out.sam
samtools view -@8 -q 30 -bS out.sam > out.bam samtools view -@8 -F 12 -b out.bam > out2.bam
|
bowtie提取双端比对序列
1 2 3 4 5 6
| bowtie2-build $ref index bowtie2 -p 8 -x index \ -1 $InDir/$fq1 \ -2 $InDir/$fq2 -S out.sam
samtools view -@8 -F 12 -bS out.sam > out.bam
|
1
| samtools flagstat out.bam
|
- secondary:多重比对,有一个初始的比对,有0或多个secondary比对,在flag中表示为0x100
- supplementary:不具有较大重叠的多个比对结果,一个拆成多个,其中一个作为代表性比对,其他作为supplementary,在flag中表示为0x800
- duplicate:已标记的测序重复,一般需要软件进行标记类似于picard’s markDuplicates
- paired in sequencing:标记为pair
- properly paired:标记为pair且比对上
例如以下两个结果,第一个read1和read2数目相等,而第二个结果read1和read2数目不等。这里都是100%比对上是因为看的是samtools view筛选过后的bam文件,里面的reads全是比对上的。
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39584227 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 134327 + 0 supplementary 0 + 0 duplicates 39584227 + 0 mapped (100.00% : N/A) 39449900 + 0 paired in sequencing 19724950 + 0 read1 19724950 + 0 read2 39320644 + 0 properly paired (99.67% : N/A) 39449900 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39567072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 125042 + 0 supplementary 0 + 0 duplicates 39567072 + 0 mapped (100.00% : N/A) 39442030 + 0 paired in sequencing 19724374 + 0 read1 19717656 + 0 read2 39318491 + 0 properly paired (99.69% : N/A) 39442030 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
bwa mem敏感度大大高于bowtie
如此还可以顺便观察bwa mem和bowtie比对结果的差异,对于两份的病毒测序数据。
bwa mem比对的结果:第一个样本99.59% ,第二个样本69.74%
bowtie比对的结果:第一个样本75.63% ,第二个样本48.42%
更具体的,对于第一个样本,
1 2
| bwa mem -t ${SLURM_CPUS_PER_TASK} $ref $InDir/$fq1 $InDir/$fq2 > out_bwa.sam samtools view -@${SLURM_CPUS_PER_TASK} -q 30 -bS out_bwa.sam > out_bwa.bam
|
flagstat
1 2 3 4 5 6 7
| echo "sam file" samtools flagstat out_bwa.sam
echo "======================"
echo "bam file" samtools flagstat out_bwa.bam
|
结果,筛选前39732983 reads mapped (99.59%),筛选后39715454 reads mapped。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| sam file 39896565 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 135539 + 0 supplementary 0 + 0 duplicates 39732983 + 0 mapped (99.59% : N/A) 39761026 + 0 paired in sequencing 19880513 + 0 read1 19880513 + 0 read2 39320644 + 0 properly paired (98.89% : N/A) 39449900 + 0 with itself and mate mapped 147544 + 0 singletons (0.37% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) ====================== bam file 39715454 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 126123 + 0 supplementary 0 + 0 duplicates 39715454 + 0 mapped (100.00% : N/A) 39589331 + 0 paired in sequencing 19866816 + 0 read1 19722515 + 0 read2 39318491 + 0 properly paired (99.32% : N/A) 39442030 + 0 with itself and mate mapped 147301 + 0 singletons (0.37% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
1 2 3 4
| bowtie2-build $ref index bowtie2 -p ${SLURM_CPUS_PER_TASK} -x index \ -1 $InDir/$fq1 \ -2 $InDir/$fq2 -S out.sam
|
flagstat
1
| samtools flagstat out.sam
|
比对结果为30071380 mapped (75.63%)
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39761026 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 30071380 + 0 mapped (75.63% : N/A) 39761026 + 0 paired in sequencing 19880513 + 0 read1 19880513 + 0 read2 25906438 + 0 properly paired (65.16% : N/A) 29570050 + 0 with itself and mate mapped 501330 + 0 singletons (1.26% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
sam/bam文件中的标记在比对完成之后已经固定
而从后面这个例子我们看到用完比对质量-q 30
后,即便加上了pair筛选-F 12
,结果read1和read2数目仍然不相等。这是因为sam/bam文件中的标记在比对完成之后就已经固定,不会在后期的筛选中发生变化,所以即便当比对质量筛选后read已经不再成对,其中的flag标记依旧代表着双端比对上。
1 2
| samtools view -@${SLURM_CPUS_PER_TASK} -F 12 -bS out.sam > outF12.bam samtools flagstat outF12.bam > f.txt
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39584227 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 134327 + 0 supplementary 0 + 0 duplicates 39584227 + 0 mapped (100.00% : N/A) 39449900 + 0 paired in sequencing 19724950 + 0 read1 19724950 + 0 read2 39320644 + 0 properly paired (99.67% : N/A) 39449900 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
1 2
| samtools view -@${SLURM_CPUS_PER_TASK} -q 30 -bS out.sam > outq30.bam samtools flagstat outq30.bam >> f.txt
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39715454 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 126123 + 0 supplementary 0 + 0 duplicates 39715454 + 0 mapped (100.00% : N/A) 39589331 + 0 paired in sequencing 19866816 + 0 read1 19722515 + 0 read2 39318491 + 0 properly paired (99.32% : N/A) 39442030 + 0 with itself and mate mapped 147301 + 0 singletons (0.37% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
1 2
| samtools view -@${SLURM_CPUS_PER_TASK} -F 12 -b outq30.bam > outq30F12.bam samtools flagstat outq30F12.bam >> f.txt
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39567072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 125042 + 0 supplementary 0 + 0 duplicates 39567072 + 0 mapped (100.00% : N/A) 39442030 + 0 paired in sequencing 19724374 + 0 read1 19717656 + 0 read2 39318491 + 0 properly paired (99.69% : N/A) 39442030 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|
- 质量筛选后进行双端比对筛选(利用管道无中间输出文件)
1 2 3
| samtools view -@${SLURM_CPUS_PER_TASK} -q 30 -bS out.sam\ | samtools view -@${SLURM_CPUS_PER_TASK} -F 12 -b - > outq30F12pipe.bam samtools flagstat outq30F12pipe.bam >> f.txt
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| 39567072 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 125042 + 0 supplementary 0 + 0 duplicates 39567072 + 0 mapped (100.00% : N/A) 39442030 + 0 paired in sequencing 19724374 + 0 read1 19717656 + 0 read2 39318491 + 0 properly paired (99.69% : N/A) 39442030 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
|