blastp的本地化使用
下载blast到本地
在NCBI官网找到blast工具即可找到下载链接,根据提示一步步完成下载即可。
官网下载链接:blast tools
网站打开速度较慢,挂梯可能会好一点。
本地使用
建库
从uniprot下载所需参考蛋白质组的fast.a文件作为建库文件此处以人类蛋白质组为例
$ makeblastdb -in uniprot-proteome%3AUP000005640.fasta -dbtype prot -parse_seqids -hash_index -out human
# 在blast-2.6.0/bin文件夹下运行该命令或在makeblastdb前加上该程序所在的路径
# 参数说明
# -in 所需建库的参考蛋白质组
# -out 输出的库名
# -dbtype 蛋白质组用prot,核酸组用nucl
# parse_seqids => Parse Seq-ids in FASTA input
# -hash_index => Create index of sequence hash values
搜库比对
将query序列比对到参考序列。此处用小鼠蛋白质组和人蛋白质组进行比对,运行如下命令:
$ blastp.exe -query uniprot-proteome%3AUP000000589.fasta -db human -evalue 1e-3 -out blast.xml -outfmt "5" -num_alignments 10 -num_threads 2
#在blast-2.6.0/bin文件夹下运行该命令或在blastp.exe前加上该程序所在的路径
# 参数说明
# -query 输入文件名,也就是需要比对的序列文件
# -db 格式化后的数据库名称
# -evalue 设定输出结果中的e-value阈值
# -out 输出文件名
# -num_alignments 输出比对上的序列的最大值条目数
# -num_threads 线程数
# 此外还有:
# -num_descriptions 对比对上序列的描述信息,一般跟tabular格式连用
# -outfmt
# 0 = pairwise,
# 1 = query-anchored showing identities,
# 2 = query-anchored no identities,
# 3 = flat query-anchored, show identities,
# 4 = flat query-anchored, no identities,
# 5 = XML Blast output,
# 6 = tabular,
# 7 = tabular with comment lines,
# 8 = Text ASN.1,
# 9 = Binary ASN.1
# 10 = Comma-separated values
提取搜库结果中的信息
xml文件所含的信息
使用outfmt 5参数的话,会产生一个xml格式的文件,对比信息很完整。一个序列的完整比对信息如下所示:
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>sp|Q62302|TX261_MOUSE Protein TEX261 OS=Mus musculus OX=10090 GN=Tex261 PE=2 SV=1</Iteration_query-def>
<Iteration_query-len>196</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>sp|Q6UWH6|TX261_HUMAN</Hit_id>
<Hit_def>Protein TEX261 OS=Homo sapiens OX=9606 GN=TEX261 PE=2 SV=1</Hit_def>
<Hit_accession>Q6UWH6</Hit_accession>
<Hit_len>196</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>391.734</Hsp_bit-score>
<Hsp_score>1005</Hsp_score>
<Hsp_evalue>8.09539e-141</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>196</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>196</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>195</Hsp_identity>
<Hsp_positive>196</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>196</Hsp_align-len>
<Hsp_qseq>MWFMYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPGDDVVSNYFTKGKRGKRLGILVVFSFIKEAILPSRQKIY</Hsp_qseq>
<Hsp_hseq>MWFMYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPGDDVVSNYFTKGKRGKRLGILVVFSFIKEAILPSRQKIY</Hsp_hseq>
<Hsp_midline>MWFMY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPGDDVVSNYFTKGKRGKRLGILVVFSFIKEAILPSRQKIY</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>tr|U3KQ87|U3KQ87_HUMAN</Hit_id>
<Hit_def>Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1</Hit_def>
<Hit_accession>U3KQ87</Hit_accession>
<Hit_len>197</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>312.768</Hsp_bit-score>
<Hsp_score>800</Hsp_score>
<Hsp_evalue>1.35723e-109</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>158</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>158</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>157</Hsp_identity>
<Hsp_positive>158</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>158</Hsp_align-len>
<Hsp_qseq>MWFMYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPG</Hsp_qseq>
<Hsp_hseq>MWFMYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPG</Hsp_hseq>
<Hsp_midline>MWFMY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPG</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>3</Hit_num>
<Hit_id>tr|F8WAR8|F8WAR8_HUMAN</Hit_id>
<Hit_def>Protein TEX261 OS=Homo sapiens OX=9606 GN=TEX261 PE=4 SV=1</Hit_def>
<Hit_accession>F8WAR8</Hit_accession>
<Hit_len>51</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>98.5969</Hsp_bit-score>
<Hsp_score>244</Hsp_score>
<Hsp_evalue>4.07837e-27</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>50</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>50</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>49</Hsp_identity>
<Hsp_positive>50</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>50</Hsp_align-len>
<Hsp_qseq>MWFMYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_qseq>
<Hsp_hseq>MWFMYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_hseq>
<Hsp_midline>MWFMY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>4</Hit_num>
<Hit_id>tr|U3KQC7|U3KQC7_HUMAN</Hit_id>
<Hit_def>Uncharacterized protein (Fragment) OS=Homo sapiens OX=9606 PE=4 SV=1</Hit_def>
<Hit_accession>U3KQC7</Hit_accession>
<Hit_len>49</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>91.2781</Hsp_bit-score>
<Hsp_score>225</Hsp_score>
<Hsp_evalue>2.69841e-24</Hsp_evalue>
<Hsp_query-from>4</Hsp_query-from>
<Hsp_query-to>50</Hsp_query-to>
<Hsp_hit-from>2</Hsp_hit-from>
<Hsp_hit-to>48</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>46</Hsp_identity>
<Hsp_positive>47</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>47</Hsp_align-len>
<Hsp_qseq>MYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_qseq>
<Hsp_hseq>MYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_hseq>
<Hsp_midline>MY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>95943</Statistics_db-num>
<Statistics_db-len>38082498</Statistics_db-len>
<Statistics_hsp-len>101</Statistics_hsp-len>
<Statistics_eff-space>2697264225</Statistics_eff-space>
<Statistics_kappa>0.041</Statistics_kappa>
<Statistics_lambda>0.267</Statistics_lambda>
<Statistics_entropy>0.14</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
提取信息
观察序列信息的各种标识,可从中提取有用的信息。下面是以julia语言写的一个简单粗暴提取比对上的两个蛋白(Accession)以及打分信息的脚本。
# Julia language
function main()
ioBlast = open("blast.xml", "r")
# 读BLAST结果
ioBPout = open("blastresult.txt", "w")
# 将提取的信息写入该文件
# 自行选择方便后续步骤的文件格式
write(ioBPout, "Mouse\tHuman\tHsp_bit-score\thspscore\t") # 信息表头
global Mouse_P, Human_P, bitscore, hspscore
# 声明变量为global,便于后续的步骤
while !eof(ioBlast)
# 按行读文件,若有所需信息则提取,没有则继续读取下一行
line = readline(ioBlast)
if occursin(r"<Iteration_query-def>", line)
(a, Mouse_P, c) = split(line, "|")
end
if occursin(r"<Hit_id>", line)
(a, Human_P, c) = split(line, "|")
end
if occursin(r"<Hsp_bit-score>", line)
(a, temps) = split(line, ">")
(bitscore, c) = split(temps, "<")
end
if occursin(r"<Hsp_score>", line)
(a, temps) = split(line, ">")
(hspscore, c) = split(temps, "<")
end
if occursin(r"</Hit>", line)
write(ioBPout, Mouse_P, "\t", Human_P, "\t", bitscore, "\t", hspscore, "\n")
end
end
close(ioBlast)
close(ioBPout)
end
main() # 运行上述函数