blastp的本地化使用

blastp在Windows系统电脑上的本地使用

下载blast到本地

在NCBI官网找到blast工具即可找到下载链接,根据提示一步步完成下载即可。
官网下载链接:blast tools
网站打开速度较慢,挂梯可能会好一点。

本地使用

建库

从uniprot下载所需参考蛋白质组的fast.a文件作为建库文件此处以人类蛋白质组为例

$ makeblastdb -in uniprot-proteome%3AUP000005640.fasta -dbtype prot -parse_seqids -hash_index -out human
# 在blast-2.6.0/bin文件夹下运行该命令或在makeblastdb前加上该程序所在的路径

# 参数说明
# -in 所需建库的参考蛋白质组
# -out 输出的库名
# -dbtype 蛋白质组用prot,核酸组用nucl
# parse_seqids => Parse Seq-ids in FASTA input
# -hash_index => Create index of sequence hash values

搜库比对

将query序列比对到参考序列。此处用小鼠蛋白质组和人蛋白质组进行比对,运行如下命令:

$ blastp.exe -query uniprot-proteome%3AUP000000589.fasta -db human -evalue 1e-3 -out blast.xml -outfmt "5" -num_alignments 10 -num_threads 2
#在blast-2.6.0/bin文件夹下运行该命令或在blastp.exe前加上该程序所在的路径

# 参数说明
# -query 输入文件名,也就是需要比对的序列文件
# -db 格式化后的数据库名称
# -evalue 设定输出结果中的e-value阈值
# -out 输出文件名
# -num_alignments 输出比对上的序列的最大值条目数
# -num_threads 线程数
# 此外还有:
# -num_descriptions 对比对上序列的描述信息,一般跟tabular格式连用
# -outfmt
#   0 = pairwise,
#   1 = query-anchored showing identities,
#   2 = query-anchored no identities,
#   3 = flat query-anchored, show identities,
#   4 = flat query-anchored, no identities,
#   5 = XML Blast output,
#   6 = tabular,
#   7 = tabular with comment lines,
#   8 = Text ASN.1,
#   9 = Binary ASN.1
#  10 = Comma-separated values

提取搜库结果中的信息

xml文件所含的信息

使用outfmt 5参数的话,会产生一个xml格式的文件,对比信息很完整。一个序列的完整比对信息如下所示:

<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>sp|Q62302|TX261_MOUSE Protein TEX261 OS=Mus musculus OX=10090 GN=Tex261 PE=2 SV=1</Iteration_query-def>
<Iteration_query-len>196</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>sp|Q6UWH6|TX261_HUMAN</Hit_id>
<Hit_def>Protein TEX261 OS=Homo sapiens OX=9606 GN=TEX261 PE=2 SV=1</Hit_def>
<Hit_accession>Q6UWH6</Hit_accession>
<Hit_len>196</Hit_len>
<Hit_hsps>
  <Hsp>
    <Hsp_num>1</Hsp_num>
    <Hsp_bit-score>391.734</Hsp_bit-score>
    <Hsp_score>1005</Hsp_score>
    <Hsp_evalue>8.09539e-141</Hsp_evalue>
    <Hsp_query-from>1</Hsp_query-from>
    <Hsp_query-to>196</Hsp_query-to>
    <Hsp_hit-from>1</Hsp_hit-from>
    <Hsp_hit-to>196</Hsp_hit-to>
    <Hsp_query-frame>0</Hsp_query-frame>
    <Hsp_hit-frame>0</Hsp_hit-frame>
    <Hsp_identity>195</Hsp_identity>
    <Hsp_positive>196</Hsp_positive>
    <Hsp_gaps>0</Hsp_gaps>
    <Hsp_align-len>196</Hsp_align-len>
    <Hsp_qseq>MWFMYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPGDDVVSNYFTKGKRGKRLGILVVFSFIKEAILPSRQKIY</Hsp_qseq>
    <Hsp_hseq>MWFMYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPGDDVVSNYFTKGKRGKRLGILVVFSFIKEAILPSRQKIY</Hsp_hseq>
    <Hsp_midline>MWFMY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPGDDVVSNYFTKGKRGKRLGILVVFSFIKEAILPSRQKIY</Hsp_midline>
  </Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>tr|U3KQ87|U3KQ87_HUMAN</Hit_id>
<Hit_def>Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1</Hit_def>
<Hit_accession>U3KQ87</Hit_accession>
<Hit_len>197</Hit_len>
<Hit_hsps>
  <Hsp>
    <Hsp_num>1</Hsp_num>
    <Hsp_bit-score>312.768</Hsp_bit-score>
    <Hsp_score>800</Hsp_score>
    <Hsp_evalue>1.35723e-109</Hsp_evalue>
    <Hsp_query-from>1</Hsp_query-from>
    <Hsp_query-to>158</Hsp_query-to>
    <Hsp_hit-from>1</Hsp_hit-from>
    <Hsp_hit-to>158</Hsp_hit-to>
    <Hsp_query-frame>0</Hsp_query-frame>
    <Hsp_hit-frame>0</Hsp_hit-frame>
    <Hsp_identity>157</Hsp_identity>
    <Hsp_positive>158</Hsp_positive>
    <Hsp_gaps>0</Hsp_gaps>
    <Hsp_align-len>158</Hsp_align-len>
    <Hsp_qseq>MWFMYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPG</Hsp_qseq>
    <Hsp_hseq>MWFMYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPG</Hsp_hseq>
    <Hsp_midline>MWFMY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIWFSTAVLIGLYVFERFPTSMIGVGLFTNLVYFGLLQTFPFIMLTSPNFILSCGLVVVNHYLAFQFFAEEYYPFSEVLAYFTFCLWIIPFAFFVSLSAGENVLPSTMQPG</Hsp_midline>
  </Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>3</Hit_num>
<Hit_id>tr|F8WAR8|F8WAR8_HUMAN</Hit_id>
<Hit_def>Protein TEX261 OS=Homo sapiens OX=9606 GN=TEX261 PE=4 SV=1</Hit_def>
<Hit_accession>F8WAR8</Hit_accession>
<Hit_len>51</Hit_len>
<Hit_hsps>
  <Hsp>
    <Hsp_num>1</Hsp_num>
    <Hsp_bit-score>98.5969</Hsp_bit-score>
    <Hsp_score>244</Hsp_score>
    <Hsp_evalue>4.07837e-27</Hsp_evalue>
    <Hsp_query-from>1</Hsp_query-from>
    <Hsp_query-to>50</Hsp_query-to>
    <Hsp_hit-from>1</Hsp_hit-from>
    <Hsp_hit-to>50</Hsp_hit-to>
    <Hsp_query-frame>0</Hsp_query-frame>
    <Hsp_hit-frame>0</Hsp_hit-frame>
    <Hsp_identity>49</Hsp_identity>
    <Hsp_positive>50</Hsp_positive>
    <Hsp_gaps>0</Hsp_gaps>
    <Hsp_align-len>50</Hsp_align-len>
    <Hsp_qseq>MWFMYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_qseq>
    <Hsp_hseq>MWFMYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_hseq>
    <Hsp_midline>MWFMY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_midline>
  </Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>4</Hit_num>
<Hit_id>tr|U3KQC7|U3KQC7_HUMAN</Hit_id>
<Hit_def>Uncharacterized protein (Fragment) OS=Homo sapiens OX=9606 PE=4 SV=1</Hit_def>
<Hit_accession>U3KQC7</Hit_accession>
<Hit_len>49</Hit_len>
<Hit_hsps>
  <Hsp>
    <Hsp_num>1</Hsp_num>
    <Hsp_bit-score>91.2781</Hsp_bit-score>
    <Hsp_score>225</Hsp_score>
    <Hsp_evalue>2.69841e-24</Hsp_evalue>
    <Hsp_query-from>4</Hsp_query-from>
    <Hsp_query-to>50</Hsp_query-to>
    <Hsp_hit-from>2</Hsp_hit-from>
    <Hsp_hit-to>48</Hsp_hit-to>
    <Hsp_query-frame>0</Hsp_query-frame>
    <Hsp_hit-frame>0</Hsp_hit-frame>
    <Hsp_identity>46</Hsp_identity>
    <Hsp_positive>47</Hsp_positive>
    <Hsp_gaps>0</Hsp_gaps>
    <Hsp_align-len>47</Hsp_align-len>
    <Hsp_qseq>MYVLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_qseq>
    <Hsp_hseq>MYLLSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_hseq>
    <Hsp_midline>MY+LSWLSLFIQVAFITLAVAAGLYYLAELIEEYTVATSRIIKYMIW</Hsp_midline>
  </Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
  <Statistics>
    <Statistics_db-num>95943</Statistics_db-num>
    <Statistics_db-len>38082498</Statistics_db-len>
    <Statistics_hsp-len>101</Statistics_hsp-len>
    <Statistics_eff-space>2697264225</Statistics_eff-space>
    <Statistics_kappa>0.041</Statistics_kappa>
    <Statistics_lambda>0.267</Statistics_lambda>
    <Statistics_entropy>0.14</Statistics_entropy>
  </Statistics>
</Iteration_stat>
</Iteration>

提取信息

观察序列信息的各种标识,可从中提取有用的信息。下面是以julia语言写的一个简单粗暴提取比对上的两个蛋白(Accession)以及打分信息的脚本。

# Julia language
function main()
    ioBlast = open("blast.xml", "r") 
    # 读BLAST结果
    ioBPout = open("blastresult.txt", "w") 
    # 将提取的信息写入该文件
    # 自行选择方便后续步骤的文件格式
    
    write(ioBPout, "Mouse\tHuman\tHsp_bit-score\thspscore\t") # 信息表头

    global Mouse_P, Human_P, bitscore, hspscore 
    # 声明变量为global,便于后续的步骤

    while !eof(ioBlast)
        # 按行读文件,若有所需信息则提取,没有则继续读取下一行
        line = readline(ioBlast)
        if occursin(r"<Iteration_query-def>", line)
            (a, Mouse_P, c) = split(line, "|")
        end
        if occursin(r"<Hit_id>", line)
            (a, Human_P, c) = split(line, "|")
        end
        if occursin(r"<Hsp_bit-score>", line)
            (a, temps) = split(line, ">")
            (bitscore, c) = split(temps, "<")
        end
        if occursin(r"<Hsp_score>", line)
            (a, temps) = split(line, ">")
            (hspscore, c) = split(temps, "<")
        end
        if occursin(r"</Hit>", line)
            write(ioBPout, Mouse_P, "\t", Human_P, "\t", bitscore, "\t", hspscore, "\n")
        end
    end

    close(ioBlast)
    close(ioBPout)            
end

main() # 运行上述函数

参考:

  1. BLAST本地化使用
  2. Blast+ xml格式解读