bwa or bowtie提取双端比对序列及flagstat

bwa和bowtie是常用的用于比对的软件,使用来看bwa mem比对的敏感度比bowtie更高(可能是高很多),而且有一段read比对到多段contigs (primary and supplementary) 的功能。

提取双端比对序列

只需要在用完相应的比对软件后,使用samtools view-F选项删除flag对应的reads,flag=4(0x4)代表这条read没有比对到参考序列上,flag=8(0x8)代表这是单端比对上的read,12=4+8,因此使用-F 12可以保留双端比对上的序列。

bwa mem

1
2
3
4
5
6
7
8
9
bwa index $ref

# can map to more than one contig
bwa mem -t 8 $ref $InDir/$fq1 $InDir/$fq2 > out.sam

# `-q 30` reads mapping quality, `-F 12` paired-end,
# `-b` output bam, `-S` input format auto-detected
samtools view -@8 -q 30 -bS out.sam > out.bam
samtools view -@8 -F 12 -b out.bam > out2.bam

bowtie提取双端比对序列

1
2
3
4
5
6
bowtie2-build $ref index
bowtie2 -p 8 -x index \
-1 $InDir/$fq1 \
-2 $InDir/$fq2 -S out.sam

samtools view -@8 -F 12 -bS out.sam > out.bam

samtools flagstat总结比对结果

1
samtools flagstat out.bam
  • secondary:多重比对,有一个初始的比对,有0或多个secondary比对,在flag中表示为0x100
  • supplementary:不具有较大重叠的多个比对结果,一个拆成多个,其中一个作为代表性比对,其他作为supplementary,在flag中表示为0x800
  • duplicate:已标记的测序重复,一般需要软件进行标记类似于picard’s markDuplicates
  • paired in sequencing:标记为pair
  • properly paired:标记为pair且比对上

例如以下两个结果,第一个read1和read2数目相等,而第二个结果read1和read2数目不等。这里都是100%比对上是因为看的是samtools view筛选过后的bam文件,里面的reads全是比对上的。

1
2
3
4
5
6
7
8
9
10
11
12
13
39584227 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
134327 + 0 supplementary
0 + 0 duplicates
39584227 + 0 mapped (100.00% : N/A)
39449900 + 0 paired in sequencing
19724950 + 0 read1
19724950 + 0 read2
39320644 + 0 properly paired (99.67% : N/A)
39449900 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
1
2
3
4
5
6
7
8
9
10
11
12
13
39567072 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
125042 + 0 supplementary
0 + 0 duplicates
39567072 + 0 mapped (100.00% : N/A)
39442030 + 0 paired in sequencing
19724374 + 0 read1
19717656 + 0 read2
39318491 + 0 properly paired (99.69% : N/A)
39442030 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem敏感度大大高于bowtie

如此还可以顺便观察bwa mem和bowtie比对结果的差异,对于两份的病毒测序数据。

bwa mem比对的结果:第一个样本99.59% ,第二个样本69.74%
bowtie比对的结果:第一个样本75.63% ,第二个样本48.42%

更具体的,对于第一个样本,

  • 使用bwa mem
1
2
bwa mem -t ${SLURM_CPUS_PER_TASK} $ref $InDir/$fq1 $InDir/$fq2 > out_bwa.sam
samtools view -@${SLURM_CPUS_PER_TASK} -q 30 -bS out_bwa.sam > out_bwa.bam

flagstat

1
2
3
4
5
6
7
echo "sam file"
samtools flagstat out_bwa.sam

echo "======================"

echo "bam file"
samtools flagstat out_bwa.bam

结果,筛选前39732983 reads mapped (99.59%),筛选后39715454 reads mapped。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
sam file
39896565 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
135539 + 0 supplementary
0 + 0 duplicates
39732983 + 0 mapped (99.59% : N/A)
39761026 + 0 paired in sequencing
19880513 + 0 read1
19880513 + 0 read2
39320644 + 0 properly paired (98.89% : N/A)
39449900 + 0 with itself and mate mapped
147544 + 0 singletons (0.37% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
======================
bam file
39715454 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
126123 + 0 supplementary
0 + 0 duplicates
39715454 + 0 mapped (100.00% : N/A)
39589331 + 0 paired in sequencing
19866816 + 0 read1
19722515 + 0 read2
39318491 + 0 properly paired (99.32% : N/A)
39442030 + 0 with itself and mate mapped
147301 + 0 singletons (0.37% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
  • 使用bowtie
1
2
3
4
bowtie2-build $ref index
bowtie2 -p ${SLURM_CPUS_PER_TASK} -x index \
-1 $InDir/$fq1 \
-2 $InDir/$fq2 -S out.sam

flagstat

1
samtools flagstat out.sam

比对结果为30071380 mapped (75.63%)

1
2
3
4
5
6
7
8
9
10
11
12
13
39761026 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
30071380 + 0 mapped (75.63% : N/A)
39761026 + 0 paired in sequencing
19880513 + 0 read1
19880513 + 0 read2
25906438 + 0 properly paired (65.16% : N/A)
29570050 + 0 with itself and mate mapped
501330 + 0 singletons (1.26% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

sam/bam文件中的标记在比对完成之后已经固定

而从后面这个例子我们看到用完比对质量-q 30后,即便加上了pair筛选-F 12,结果read1和read2数目仍然不相等。这是因为sam/bam文件中的标记在比对完成之后就已经固定,不会在后期的筛选中发生变化,所以即便当比对质量筛选后read已经不再成对,其中的flag标记依旧代表着双端比对上。

  • 只进行双端比对筛选
1
2
samtools view -@${SLURM_CPUS_PER_TASK} -F 12 -bS out.sam > outF12.bam
samtools flagstat outF12.bam > f.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
39584227 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
134327 + 0 supplementary
0 + 0 duplicates
39584227 + 0 mapped (100.00% : N/A)
39449900 + 0 paired in sequencing
19724950 + 0 read1
19724950 + 0 read2
39320644 + 0 properly paired (99.67% : N/A)
39449900 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
  • 只进行比对质量筛选
1
2
samtools view -@${SLURM_CPUS_PER_TASK} -q 30 -bS out.sam > outq30.bam
samtools flagstat outq30.bam >> f.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
39715454 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
126123 + 0 supplementary
0 + 0 duplicates
39715454 + 0 mapped (100.00% : N/A)
39589331 + 0 paired in sequencing
19866816 + 0 read1
19722515 + 0 read2
39318491 + 0 properly paired (99.32% : N/A)
39442030 + 0 with itself and mate mapped
147301 + 0 singletons (0.37% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
  • 质量筛选后进行双端比对筛选
1
2
samtools view -@${SLURM_CPUS_PER_TASK} -F 12 -b outq30.bam > outq30F12.bam
samtools flagstat outq30F12.bam >> f.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
39567072 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
125042 + 0 supplementary
0 + 0 duplicates
39567072 + 0 mapped (100.00% : N/A)
39442030 + 0 paired in sequencing
19724374 + 0 read1
19717656 + 0 read2
39318491 + 0 properly paired (99.69% : N/A)
39442030 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
  • 质量筛选后进行双端比对筛选(利用管道无中间输出文件)
1
2
3
samtools view -@${SLURM_CPUS_PER_TASK} -q 30 -bS out.sam\
| samtools view -@${SLURM_CPUS_PER_TASK} -F 12 -b - > outq30F12pipe.bam
samtools flagstat outq30F12pipe.bam >> f.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
39567072 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
125042 + 0 supplementary
0 + 0 duplicates
39567072 + 0 mapped (100.00% : N/A)
39442030 + 0 paired in sequencing
19724374 + 0 read1
19717656 + 0 read2
39318491 + 0 properly paired (99.69% : N/A)
39442030 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)