blast --从建库到分析

今天是除夕,新年快乐呀!

本文讲述了 blast 从建库到分析的全过程,并包括中间可能会遇到的一些问题和操作。

Step 0: install blast

BLAST 包括:makeblastdb, blastn, blastp, blastx, tblastn, tblastx 等。

1
conda install -c bioconda blast
Program query database alignment level
blastn nucleotide nucleotide nucleotide
blastp peptide peptide peptide
blastx nucleotide peptide peptide
tblastn peptide nucleotide peptide
tblastx nucleotide nucleotide peptide

Step 1: build database

NCBI 手册上说,
The makeblastdb application produces BLAST databases from FASTA files.
Assigning a unique identifier to every sequence:

  • begin with “>”
  • not use “|” in the indentifier

在现有需要建库的 reads 的 fa 文件中,它的 id 长下面这样

1
2
sed -n 1p dr.fsa           
>CP009803.1|CM002272.1|CP029339.1
  • 换成 _
1
2
3
sed -i 's/原字符串/替换字符串/g' filename
# mac unix + ''
sed -i '' 's/|/_/g' dr.fsa

结尾加g:替换每一行匹配的每个字符,否则只替换匹配到的第一个

1
2
3
sed -i '' 's/|/_/g' dr.fsa
sed -n 1p dr.fsa
>CP009803.1_CM002272.1_CP029339.1

尝试建库 makeblastdb

1
makeblastdb -in dr.fsa -dbtype nucl -out dr -parse_seqids

但是报错:id 太长了……

1
BLAST Database creation error: Near line 17, the local id is too long.  Its length is 87 but the maximum allowed local id length is 50.  Please find and correct all local ids that are too long.
  • 决定把 _ 及之后的都删掉
1
sed -i 's/\_.*//' dr.fsa

再次建库

1
makeblastdb -in dr.fsa -dbtype nucl -out dr -parse_seqids

报错:需要 id unique……

1
2
3
BLAST Database creation error: Error: Duplicate seq_ids are found: 
GB|CP021130.1
make a unique identifier for each sequence
  • bash 命令,使得 id unique
1
awk '{if ($0~/^>/) {h[$1]++; $1=$1"_" h[$1]} print}' dr.fsa > dr.fa
  • 建库
1
makeblastdb -in dr.fa -dbtype nucl -out repeats -parse_seqids
  • -in 建库输入的fasta序列
  • -dbtype 序列类型,nucl为核酸,prot为蛋白
  • -title 数据库名字
  • -parse_seqids 推荐加上,似乎跟版本有关
  • -out 数据库名,blast+搜索时用到的-db的参数
  • -logfile 日志文件,默认输出到屏幕

建库完成,日志信息如下

1
2
3
4
5
6
7
Building a new DB, current time: 01/25/2022 16:46:10
New DB name: [我的目录]/repeats
New DB title: dr.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 26958 sequences in 0.460787 seconds.

Step 2: blastn 测试

  • 查看信息
1
blastdbcmd -db repeats -dbtype nucl -info
  • blastn 比对(query 序列为 q.fa),其他 programs 类似
1
2
blastn -db repeats -query q.fa > q.txt
blastn -db repeats -query q.fa -outfmt 6 > qf6.txt

由于序列比较短,the results got many ‘no hits found’.

-stask blastn-short may help

1
blastn -task blastn-short -db repeats -query q.fa -outfmt 6 > qqf6.txt

但是结果也太多了,设置筛选条件 -evalue

1
blastn -task blastn-short -db repeats -query q.fa -outfmt 6 -evalue 0.01 > qqf6.txt

但是对于完全匹配的输出也很多,而且我希望能按照 e-value 排序,使用 -max_target_seqs 输出 e-value 最小的几条序列。

1
blastn -task blastn-short -db repeats -query q.fa -outfmt 6 -evalue 0.01 -max_target_seqs 5 > qqqf6.txt

直接 1 条也行

1
blastn -task blastn-short -db repeats -query q.fa -outfmt 0 -max_target_seqs 1 > qqq.txt

Step 3: blast

用 python 从一份 csv 表格中准备比对序列

1
2
3
4
5
6
7
import pandas as pd
arr = pd.read_csv('../01_train_test/archaeal.csv')
file = open("../01_train_test/archaeal.txt","w")
for i in range(0,arr.shape[0]):
file.write('>' + arr.iloc[i]["contig"] +'\n')
file.write(arr.iloc[i]["repeats"] +'\n')
file.close()
  • make a unique identifier for each sequence
1
awk '{if ($0~/^>/) {h[$1]++; $1=$1"_" h[$1]} print}' archaeal.txt > archaeal.fa

但其实,一开始就可以用 python 处理 unique id,例如

1
2
3
4
5
6
newlist = []
mylist = list(arr['contig'])
for i, v in enumerate(mylist):
count = mylist[:i].count(v)
newlist.append(v + '_' + str(count + 1))
arr['contigs'] = newlist
  • 短序列比对 blastn -task blastn-short
1
blastn -task blastn-short -db repeats -query archaeal.fa -outfmt 6 -max_target_seqs 5 > blast_archaeal.txt

结果的每一列分别对应:qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1
2
3
CP000742_1	CP009501.1_2	95.455	22	1	0	13	34	25	4	1.67e-04	36.2
CP000742_1 AP021879.1_7 95.455 22 1 0 13 34 25 4 1.67e-04 36.2
CP000742_1 AP021876.1_10 95.455 22 1 0 13 34 25 4 1.67e-04 36.2
  1. qseqid query or source (e.g., gene) sequence id
  2. sseqid subject or target (e.g., reference genome) sequence id
  3. pident percentage of identical matches
  4. length alignment length (sequence overlap)
  5. mismatch number of mismatches
  6. gapopen number of gap openings
  7. qstart start of alignment in query
  8. qend end of alignment in query
  9. sstart start of alignment in subject
  10. send end of alignment in subject
  11. evalue expect value
  12. bitscore bit score

Step 4: 分析准备

1
2
3
4
import pandas as pd
data = pd.read_csv('../01_train_test/blast_archaeal.txt', sep="\t", header = None)
data.columns = ["qseqid", "sseqid", "pident", "length", "mismatch",\
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"]

可以只留下 evalue 最小的一项

1
data.drop_duplicates(subset='qseqid', keep="first")

然后这个表格就可以做后面的任意分析操作了~