SNP calling using VARSCAN

VarScan 用 Java 编写,命令行执行,输入pileup文件。

安装

https://sourceforge.net/projects/varscan/files/

下载相应版本VarScan.jar文件到指定位置如/public/home/huangsisi/bin/varscan/VarScan.v2.3.9.jar

使用方法

1
java -jar /public/home/huangsisi/bin/varscan/VarScan.v2.3.9.jar [命令]

call SNP流程

  • fastp质控
1
2
fastp -i ${InDir}/${fq1} -I ${InDir}/${fq2} -w ${SLURM_CPUS_PER_TASK} \
-o ${OutDir}/${fq1} -O ${OutDir}/${fq2}
  • bwa比对,要sort
1
2
3
4
5
6
# 建立index
bwa index ${ref}

bwa mem -M ${ref} $OutDir/$fq1 $OutDir/$fq2 |\
samtools view -hbS -F4 - |\
samtools sort -@ ${SLURM_CPUS_PER_TASK} - > sort.bam

-F4去掉比对不上的读,但可能非必要。

  • 生成pileup文件
1
samtools mpileup -f ${ref} sort.bam > sort.mpileup
  • varscan mpileup2snp call SNP
1
2
java -jar /public/home/huangsisi/bin/varscan/VarScan.v2.3.9.jar \
mpileup2snp sort.mpileup > mpileup_varscan_snp.vcf

后两步骤也可以合起来

1
2
3
samtools mpileup -f ${ref} sort.bam |\
java -jar /public/home/huangsisi/bin/varscan/VarScan.v2.3.9.jar\
mpileup2snp > variants.vcf

注意:我一开始使用samtools mpileup中加了-B参数,

-B, –no-BAQ disable BAQ (per-Base Alignment Quality)

即忽略了比对质量,导致多了很多不可靠突变,这是不可取的,但是作为探索似乎也能说明什么……

  • 个人完整版本
1
2
3
4
5
6
7
8
fastp -i ${InDir}/${fq1} -I ${InDir}/${fq2} -w ${SLURM_CPUS_PER_TASK} \
-o ${OutDir}/${fq1} -O ${OutDir}/${fq2}
bwa index ${ref}
bwa mem -M ${ref} $OutDir/$fq1 $OutDir/$fq2 | samtools view -hbS -F4 - | samtools sort -@ ${SLURM_CPUS_PER_TASK} - > ${OutDir}/sort.bam
samtools mpileup -f ${ref} ${OutDir}/sort.bam |\
java -jar /public/home/huangsisi/bin/varscan/VarScan.v2.3.9.jar \
mpileup2snp > ${OutDir}/sample.vcf \
--min-reads2 5 --min-var-freq 0.05 --strand-filter 0

输出结果

将以上流程应用到新冠病毒分离培养捕获测序样本(credit to jianglab),结果如下

1
2
3
4
5
6
Chrom   Position        Ref     Var     Cons:Cov:Reads1:Reads2:Freq:P-value     StrandFilter:R1+:R1-:R2+:R2-:pval       SamplesRef      SamplesHet      SamplesHom      SamplesNC       Cons:Cov:Reads1:Reads2:Freq:P-value
NC_045512.2 9 T A W:76:60:16:21.05%:6.3227E-6 Pass:1.0:58:2:16:0:6.2105E-1 0 1 0 0 W:76:60:16:21.05%:6.3227E-6
NC_045512.2 22303 T G G:7430:405:6962:93.7%:0E0 Pass:396:9:5709:1253:1E0 0 0 1 0 G:7430:405:6962:93.7%:0E0
NC_045512.2 27775 T C C:5853:159:5683:97.16%:0E0 Pass:157:2:4049:1634:1E0 0 0 1 0 C:5853:159:5683:97.16%:0E0
NC_045512.2 27776 T G G:5805:82:5637:97.17%:0E0 Pass:79:3:4011:1626:1E0 0 0 1 0 G:5805:82:5637:97.17%:0E0
NC_045512.2 27777 G A A:6421:163:6238:97.21%:0E0 Pass:157:6:4030:2208:1E0 0 0 1 0 A:6421:163:6238:97.21%:0E0

其中Reads1表示未发生突变数量,Reads2表示突变数量,R1+表示未发生突变的正义链数量,R1-、R2+、R2-类似理解,Freq为突变频率。(但是我算了一下有些不太对头,并不清楚每个值明确是怎样计算的,大致差不多)

可以看到高于90%的高频突变为

位置 参考序列碱基 突变
22303 T G
27775 T C
27776 T G
27777 G A

以及程序输出结果

1
2
3
4
5
6
7
8
9
10
11
12
Only SNPs will be reported
Warning: No p-value threshold provided, so p-values will not be calculated
Min coverage: 8
Min reads2: 2
Min var freq: 0.2
Min avg qual: 15
P-value thresh: 0.01
Reading input from STDIN
29903 bases in pileup file
6 variant positions (5 SNP, 1 indel)
0 were failed by the strand-filter
5 variant positions reported (5 SNP, 0 indel)

mpileup2snp选项说明

  • –min-coverage:最小读深度[8]
  • –min-reads2:呼叫突变的位置处的最小支持读数[2]
  • –min-avg-qual:在计算读次数的位置上的最低碱基质量[15]
  • –min-var-freq:最低突变等位基因频率阈值[0.01]
  • –min-freq-for-hom:呼叫纯合子的最低频率[0.75]
  • –p-value:呼叫突变的p值阈值[99e-02]
  • –strand-filter:忽略在一个链上支持> 90%的突变[1]
  • –output-vcf:如果设置为1,以VCF格式输出,不需要可以不设置
  • –variants:仅报告突变(SNP / indel)位置(仅适用于mpileup2cns)[0]

由于扩增子测序链常常有bias(正反链比例悬殊比如2205:72),因此采用--strand-filter 0,同时可以降低频率阈值查看更低频突变。例如

1
2
3
java -jar /public/home/huangsisi/bin/varscan/VarScan.v2.3.9.jar \
mpileup2snp sort.mpileup > mpileup_varscan_snp.vcf \
--min-var-freq 0.05 --strand-filter 0

此外,varscan除mpileup2snp外,还有

  • mpileup2indel:call indel
  • mpileup2cns:makes consensus calls (SNP/Indel/Reference)