1coreしか使ってくれないclassify-consensus-blastを並列実行して早くする

お題の通り、1coreしか使ってくれないQIIME2のconsensus BLASTをスピードアップするために並行実行するように工夫したよ。という記事です。

使用コア数次第ですが、24スレッドで7倍ほどスピードアップできました。

この記事の内容

はじめに

2023.2リリースのqiime2には、いくつかの相同性検索法が搭載されています。

世界的に見て最も一般的な相同性検索はBLASTといっても過言ではありませんし、QIIME2にも相同性検索ツールとしてBLAST検索ができるclassify-consensus-blastが提供されています。

BLAST+に実装されているBlastnとは動作が異なります。解析をする上で影響が大きいのは1コアでしか動作しないことです。

QIIME2でBlastがmultithreadingできない理由。
blastnの-subject optionと相性が悪い模様。https://t.co/dLW9HqshVI
— はじめての環境DNA (@eDNA_startup) March 30, 2023

ツールのissueによると、fastaファイルをそのままデータベースに使用するBLAST+のsubjectオプションが元凶のようです。

以下はQIIME2のBlastn実行コマンドの部分

https://github.com/qiime2/q2-feature-classifier/blob/2a86cf43a4290fd96784dca83bd55af5db574826/q2_feature_classifier/_blast.py#L62-L65

遅いなら外せばいいのでは？とも思いますが、subjectオプションがQIIME2で使われているのは理由があります。Qiime2では独自データ保存形式としてqzaを使用しています。

解析用データとメタデータが組み合わせたような構造を持つこの独自ファイル形式は、もちろんですがblastdbには対応しておらず、同梱できる配列データは基本的にFastaのみです。

そのため、QIIME2ではデータベースファイルもfasts として使用する必要があり、subjectオプションの使用が必須となるため、classify-consensus-blastは1core使用に限定されているというのが現状です

Blast検索だけQIIME2外でBlast+を使用して、解析を進めることもできますが、冗長性・互換性、後続の解析などを考慮すると、可能な限りQIIME2内でデータ処理するのが望ましいと思い、並列実行が可能な方法を模索しました。

全体の流れ

アプローチは至って簡単です。1coreでしか動作しないのであれば、使用可能コアの分だけデータを分割した後に結果を結合すれば並列処理したのと同じだろうという考えで進めています。

以下のステップで、こちらのgithubリポジトリに用意したDemoデータを用いて説明していきます。•••のマークからダウンロードして以下に説明するフォルダに配置して下さい。

query配列の変換 qza -> fasta
seqkitによる分割 split
qiime importによる変換 fasta -> qza
reference データベースファイルの変換 fasta, tsv -> qza
parallelを用いたclassify-consensus-blast
Blast結果の変換 -> 結合 -> 変換 qza -> tsv -> qza

GitHubリポジトリにあるリファレンスデータはこちらの記事で作成したものです。

準備と用意するもの

以下の3データを用意します。

Denoisngまたはclustering後のquery配列 : (qza)
Referenceとなる配列データ : (qza or fasta)
Referenceとなるtaxonomyデータ : (qza or tsv)

使用したソフトウェアは以下の通り

seqkit (Version: 2.3.0 or later)
qiime2 (2022.11 or later)
parallel (20221122 or later)

qiime2用の仮想環境は用意しておきましょう(以下の記事のqiime2 core 2022.11 distributionをcondaの仮想環境にインストールまでを参照)。

q2reformatフォルダをデスクトップ上のbioinfoフォルダ内に作成します。その後、デスクトップ上でコマンドラインを開いて以下のように入力。

mkdir -p ./bioinfo/q2reformat && cd ./bioinfo/q2reformat

1	mkdir -p ./bioinfo/q2reformat && cd ./bioinfo/q2reformat

次に出力フォルダ等を作成

mkdir -p ./01output/qza ./01output/qzv ./01output/other ./01output/other/blast_res ./02ref

1	mkdir -p ./01output/qza ./01output/qzv ./01output/other ./01output/other/blast_res ./02ref

q2reformatをベースとして、01outputのqza, qzv, 02refフォルダにデータを出力していきます。

Referenceとなる2つのデータファイルは、解凍済みの状態で02refフォルダに、query配列ファイルは01output/qzaに配置します。

ここの段階で以下のようなフォルダ構造になっていればOK。

tree

tree

.
├── 01output
│   ├── blast_res
│   ├── other
│   ├── qza
│   │   └── 03filtered-rep-seqs.qza
│   └── qzv
└── 02ref
    ├── 16S_ribosomal_RNA_seq_acc.fasta
    └── 16S_ribosomal_RNA_tax_qiime2-form.tsv

├── 01output

│ ├── blast_res

│ ├── other

│ ├── qza

│ │ └── 03filtered-rep-seqs.qza

│ └── qzv

└── 02ref

├── 16S_ribosomal_RNA_seq_acc.fasta

└── 16S_ribosomal_RNA_tax_qiime2-form.tsv

query配列の変換 qza -> fasta

Blastnを並列実行できるように配列データを分割していきます。seqkitで分割できる形式にするために、qzaファイルをfastaファイルへ変換します。

qiime tools export --input-path ./01output/qza/03filtered-rep-seqs.qza --output-path ./01output/other/

1	qiime tools export --input-path ./01output/qza/03filtered-rep-seqs.qza --output-path ./01output/other/

otherフォルダにdna-sequences.fastaファイルが出力されればOKです。

seqkitによる分割 : split

query配列を分割していきます。分割する数は多いほうが同時に多くのデータを並列処理できるので自身のPCのthread数に応じて適宜変更してください。

thread=24
seqkit split ./01output/other/dna-sequences.fasta -f -w 0 -2 -p ${thread} -O ./01output/other/sep_seq/

1 2	thread=24 seqkit split ./01output/other/dna-sequences.fasta -f -w 0 -2 -p ${thread} -O ./01output/other/sep_seq/

otherフォルダ下にsep_seqというフォルダが生成され、その中にdna-sequences.part_0XX.fastaという24個のfastaファイルが生成されていればOKです。

qiime importによる変換 : fasta -> qza

分割したfastaファイルをQIIME2で扱えるように、importします。

ls ./01output/other/sep_seq/*fasta | sed -e 's/\.fasta//g' -e 's/\.\/01output\/other\/sep_seq\///g' | \
while read fasta; do \
qiime tools import \
--input-path ./01output/other/sep_seq/${fasta}.fasta \
--output-path ./01output/other/sep_seq/${fasta}.qza \
--input-format "DNAFASTAFormat" \
--type "FeatureData[Sequence]"; done

ls ./01output/other/sep_seq/*fasta | sed -e 's/\.fasta//g' -e 's/\.\/01output\/other\/sep_seq\///g' | \

while read fasta; do \

qiime tools import \

--input-path ./01output/other/sep_seq/${fasta}.fasta \

--output-path ./01output/other/sep_seq/${fasta}.qza \

--input-format "DNAFASTAFormat" \

--type "FeatureData[Sequence]"; done

otherフォルダ下のsep_seqフォルダ内にdna-sequences.part_0XX.qzaというファイルが生成されていればOKです。

reference データベースファイルの変換 : fasta, tsv -> qza

fasta形式のreference 配列データベースファイルをqzaファイルに変換します。

qiime tools import \
--input-path ./02ref/16S_ribosomal_RNA_seq_acc.fasta \
--output-path ./02ref/16S_ribosomal_RNA_seq_acc.qza \
--input-format "DNAFASTAFormat" \
--type "FeatureData[Sequence]"

qiime tools import \

--input-path ./02ref/16S_ribosomal_RNA_seq_acc.fasta \

--output-path ./02ref/16S_ribosomal_RNA_seq_acc.qza \

--input-format "DNAFASTAFormat" \

--type "FeatureData[Sequence]"

識別子と系統分類情報が対応する形になっているtsvファイルをqzaファイルに変換します。

qiime tools import \
--input-path ./02ref/16S_ribosomal_RNA_tax_qiime2-form.tsv \
--output-path ./02ref/16S_ribosomal_RNA_qiime2-form.qza \
--type 'FeatureData[Taxonomy]' \
--input-format TSVTaxonomyFormat

qiime tools import \

--input-path ./02ref/16S_ribosomal_RNA_tax_qiime2-form.tsv \

--output-path ./02ref/16S_ribosomal_RNA_qiime2-form.qza \

--type 'FeatureData[Taxonomy]' \

--input-format TSVTaxonomyFormat

02refフォルダ下に16S_ribosomal_RNA_seq_acc.qzaと16S_ribosomal_RNA_qiime2-form.qzaが生成されていればOKです。

parallelを用いた`classify-consensus-blast`

並列実行するためにGNU parallelを使用してclassify-consensus-blastを実行します。並行実行数は分割したfastaファイルの数としておきます。

thread=24
time ls ./01output/other/sep_seq/*qza | sed -e 's/\.qza//g'  -e 's/\.\/01output\/other\/sep_seq\///g' | \
parallel -j ${thread} \
qiime feature-classifier classify-consensus-blast \
--i-query ./01output/other/sep_seq/{}.qza \
--i-reference-reads ./02ref/16S_ribosomal_RNA_seq_acc.qza \
--i-reference-taxonomy ./02ref/16S_ribosomal_RNA_qiime2-form.qza \
--o-classification ./01output/qza/Blasted_res_{}.qza \
--o-search-results ./01output/other/blast_res/{}

thread=24

time ls ./01output/other/sep_seq/*qza | sed -e 's/\.qza//g' -e 's/\.\/01output\/other\/sep_seq\///g' | \

parallel -j ${thread} \

qiime feature-classifier classify-consensus-blast \

--i-query ./01output/other/sep_seq/{}.qza \

--i-reference-reads ./02ref/16S_ribosomal_RNA_seq_acc.qza \

--i-reference-taxonomy ./02ref/16S_ribosomal_RNA_qiime2-form.qza \

--o-classification ./01output/qza/Blasted_res_{}.qza \

--o-search-results ./01output/other/blast_res/{}

blast_resフォルダ下にdna-sequences.part_0XX.qzaというフォルダが、qzaフォルダ下にはBlasted_res_dna-sequences.part_0XX.qzaファイルが生成されていればOKです。7分程度で終了しました。

Blast結果の変換 -> 結合 -> 変換 : qza -> tsv -> qza

QIIME2 classify-consensus-blastの出力formatはblastnの-outfmt 6と同じでヘッダーなしタブ区切りテキストです。

そのため、出力結果をcatで結合した後にqza形式に変換すれば1つのquery配列データに対してBlastnを実行したのと同じ結果が得られるはずです。

まず、Blastnの出力結果qzaファイルをtsv形式に変換します。

ls ./01output/other/blast_res/*qza | \
sed -e 's/\.qza//g'  -e 's/\.\/01output\/other\/blast_res\///g' | \
while read blast; do
qiime tools export \
--input-path ./01output/other/blast_res/${blast}.qza \
--output-path ./01output/other/blast_res/${blast}
done

ls ./01output/other/blast_res/*qza | \

sed -e 's/\.qza//g' -e 's/\.\/01output\/other\/blast_res\///g' | \

while read blast; do

qiime tools export \

--input-path ./01output/other/blast_res/${blast}.qza \

--output-path ./01output/other/blast_res/${blast}

done

次に、分割されているblastnの結果ファイルをcatで結合していきます。

find ./01output/other/blast_res/dna-sequences.part_*/* -type f | xargs cat | sort > ./01output/other/cat_blastres.tsv

1	find ./01output/other/blast_res/dna-sequences.part_/ -type f \| xargs cat \| sort > ./01output/other/cat_blastres.tsv

出力結果を確認してみます。

cat ./01output/other/cat_blastres.tsv | head

1	cat ./01output/other/cat_blastres.tsv \| head

0000c6ca71c5631381dff794f8f07d11	NR_041461.1	99.294	425.0	3.0	0.0	1.0	425.0	356.0	780.0	0.0	769.0
0001eea7ab29bb289d012e6bd8758221	NR_149795.1	86.588	425.0	56.0	1.0	1.0	425.0	330.0	753.0	1.61e-131	468.0
00020968a5f8e5ce02eaa30626ac43bb	NR_027557.1	99.261	406.0	3.0	0.0	1.0	406.0	360.0	765.0	0.0	734.0
000346ff3a3128c08a6b4b637358d543	NR_026477.1	93.503	431.0	24.0	2.0	1.0	429.0	335.0	763.0	0.0	638.0
00039519843f135454ca0908ec384087	NR_134146.1	99.068	429.0	4.0	0.0	1.0	429.0	333.0	761.0	0.0	771.0
0004054290ba0d5d360eb24e694a02bd	NR_042118.1	97.052	407.0	11.0	1.0	1.0	406.0	314.0	720.0	0.0	684.0
00053abd1cbeb6e5ac9946bf848d2256	NR_026119.1	98.824	425.0	5.0	0.0	1.0	425.0	330.0	754.0	0.0	760.0
0007a473f51135f58709ea4eae46b0f8	NR_025481.1	99.535	430.0	2.0	0.0	1.0	430.0	353.0	782.0	0.0	784.0
00086988cca401483030b83507e289fe	NR_156071.1	88.077	260.0	29.0	2.0	1.0	259.0	517.0	775.0	2.18e-83	307.0
000a711a938da0af00c82531e7728fca	NR_113355.1	98.592	426.0	6.0	0.0	1.0	426.0	373.0	798.0	0.0	754.0

0000c6ca71c5631381dff794f8f07d11 NR_041461.1 99.294 425.0 3.0 0.0 1.0 425.0 356.0 780.0 0.0 769.0

0001eea7ab29bb289d012e6bd8758221 NR_149795.1 86.588 425.0 56.0 1.0 1.0 425.0 330.0 753.0 1.61e-131 468.0

00020968a5f8e5ce02eaa30626ac43bb NR_027557.1 99.261 406.0 3.0 0.0 1.0 406.0 360.0 765.0 0.0 734.0

000346ff3a3128c08a6b4b637358d543 NR_026477.1 93.503 431.0 24.0 2.0 1.0 429.0 335.0 763.0 0.0 638.0

00039519843f135454ca0908ec384087 NR_134146.1 99.068 429.0 4.0 0.0 1.0 429.0 333.0 761.0 0.0 771.0

0004054290ba0d5d360eb24e694a02bd NR_042118.1 97.052 407.0 11.0 1.0 1.0 406.0 314.0 720.0 0.0 684.0

00053abd1cbeb6e5ac9946bf848d2256 NR_026119.1 98.824 425.0 5.0 0.0 1.0 425.0 330.0 754.0 0.0 760.0

0007a473f51135f58709ea4eae46b0f8 NR_025481.1 99.535 430.0 2.0 0.0 1.0 430.0 353.0 782.0 0.0 784.0

00086988cca401483030b83507e289fe NR_156071.1 88.077 260.0 29.0 2.0 1.0 259.0 517.0 775.0 2.18e-83 307.0

000a711a938da0af00c82531e7728fca NR_113355.1 98.592 426.0 6.0 0.0 1.0 426.0 373.0 798.0 0.0 754.0

ちゃんと検索はかかっていそうです。

最後にcatで結合したtsvファイルをqza形式に変換します。

qiime tools import \
--input-path ./01output/other/cat_blastres.tsv \
--output-path ./01output/qza/blast_results.qza \
--type "FeatureData[BLAST6]" \
--input-format "BLAST6Format"

qiime tools import \

--input-path ./01output/other/cat_blastres.tsv \

--output-path ./01output/qza/blast_results.qza \

--type "FeatureData[BLAST6]" \

--input-format "BLAST6Format"

qzaフォルダにblast_results.qzaというファイルが生成されていればOKです。

単一ファイルでclassify-consensus-blastを実行すると

ちなみに、seqkitでqueryファイルを分割せずにclassify-consensus-blastを実行すると87分かかりました。

qiime feature-classifier classify-consensus-blast \
--i-query ./01output/qza/03filtered-rep-seqs.qza \
--i-reference-reads ./02ref/16S_ribosomal_RNA_seq_acc.qza \
--i-reference-taxonomy ./02ref/16S_ribosomal_RNA_qiime2-form.qza \
--o-classification ./01output/qza/Blasted_res_single.qza \
--o-search-results ./01output/other/blast_res/Blasted_res_single

qiime feature-classifier classify-consensus-blast \

--i-query ./01output/qza/03filtered-rep-seqs.qza \

--i-reference-reads ./02ref/16S_ribosomal_RNA_seq_acc.qza \

--i-reference-taxonomy ./02ref/16S_ribosomal_RNA_qiime2-form.qza \

--o-classification ./01output/qza/Blasted_res_single.qza \

--o-search-results ./01output/other/blast_res/Blasted_res_single

classify-consensus-blastについて

通常のBLASTnは1つのquery配列に対して、e-valueやBitスコアに応じた分類情報が得られます。

一方、consensusメソッドを採用した分類法はBlastnで得られた結果の分類階級を、実行したコマンドパラメーターに応じて調整し、最終的には1系統の情報にして結果を出力します。

この論文によると、

perc_identityに指定した類似度でクエリ配列と一致した参照配列をmaxacceptsで指定した数まで出力
少なくともmin_consensusで指定した値で一致する分類学的系統を決定

します。

この時、min_consensusパラメーターが最終結果に大きくかかわってきます。

例えば、maxaccepts = 3、min_consensus = 0.51(default)で分類された以下のようなトップヒットの結果あった場合、

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__brevis

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__brevis

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__delbrueckii

最終的に割り当てられる分類学的系統は以下の通り。

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus; s__brevis

ただし、min_consensus = 0.99とした場合には種レベルの同定ではコンセンサスが取れないので、属レベルまで分類階級を下げた以下の結果が最終的な系統分類の結果になります。

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Lactobacillus

どれとも断言できないなら、保守的に同定しようという考え方ですね。

まとめコード

q2formatフォルダにqiime2-blast.shファイルを配置して、bash qiime2-blast.shで実行可能なはずです(実行時間10分程度)。

#!/bin/bash
set -e
# make output directory
mkdir -p ./01output/qza ./01output/qzv ./01output/other ./01output/other/blast_res ./02ref 
# set variable
out_qza='01output/qza'
out_qzv='01output/qzv' 
out_other='01output/other'
blast_res='01output/other/blast_res' 
ref='02ref' 
# Export query sequence
qiime tools export \
--input-path ./${out_qza}/03filtered-rep-seqs.qza \
--output-path ./${out_other}/
# split seq
thread=24
seqkit split \
./${out_other}/dna-sequences.fasta \
-f  -w 0  -2  -p ${thread} \
-O ./${out_other}/sep_seq/
# import splited fasta
ls ./${out_other}/sep_seq/*fasta | \
sed -e 's/\.fasta//g' -e 's/\.\/01output\/other\/sep_seq\///g' | \
while read fasta; do \
qiime tools import \
--input-path ./${out_other}/sep_seq/${fasta}.fasta \
--output-path ./${out_other}/sep_seq/${fasta}.qza \
--input-format "DNAFASTAFormat" \
--type "FeatureData[Sequence]"; done
# import reference fasta and tsv file
qiime tools import \
--input-path ./${ref}/16S_ribosomal_RNA_seq_acc.fasta \
--output-path ./${ref}/16S_ribosomal_RNA_seq_acc.qza \
--input-format "DNAFASTAFormat" \
--type "FeatureData[Sequence]"
qiime tools import \
--input-path ./${ref}/16S_ribosomal_RNA_tax_qiime2-form.tsv \
--output-path ./${ref}/16S_ribosomal_RNA_qiime2-form.qza \
--type 'FeatureData[Taxonomy]' \
--input-format TSVTaxonomyFormat
# Parallelize classify-consensus-blast
thread=24
time ls ./${out_other}/sep_seq/*qza | sed -e 's/\.qza//g'  -e 's/\.\/01output\/other\/sep_seq\///g' | \
parallel -j ${thread} \
qiime feature-classifier classify-consensus-blast \
--i-query ./${out_other}/sep_seq/{}.qza \
--i-reference-reads ./${ref}/16S_ribosomal_RNA_seq_acc.qza \
--i-reference-taxonomy ./${ref}/16S_ribosomal_RNA_qiime2-form.qza \
--o-classification ./${out_qza}/Blasted_res_{}.qza \
--o-search-results ./${blast_res}/{}
# Convert blast results
ls ./${blast_res}/*qza | \
sed -e 's/\.qza//g'  -e 's/\.\/01output\/other\/blast_res\///g' | \
while read blast; do
qiime tools export \
--input-path ./${blast_res}/${blast}.qza \
--output-path ./${blast_res}/${blast}
done
# concatenate blast result files
find ./${blast_res}/dna-sequences.part_*/* -type f | xargs cat | sort > ./${out_other}/cat_blastres.tsv
# Import blast result tsv file
qiime tools import \
--input-path ./${out_other}/cat_blastres.tsv \
--output-path ./${out_qza}/blast_results.qza \
--type "FeatureData[BLAST6]" \
--input-format "BLAST6Format"

#!/bin/bash

set -e

# make output directory

mkdir -p ./01output/qza ./01output/qzv ./01output/other ./01output/other/blast_res ./02ref

# set variable

out_qza='01output/qza'

out_qzv='01output/qzv'

out_other='01output/other'

blast_res='01output/other/blast_res'

ref='02ref'

# Export query sequence

qiime tools export \

--input-path ./${out_qza}/03filtered-rep-seqs.qza \

--output-path ./${out_other}/

# split seq

thread=24

seqkit split \

./${out_other}/dna-sequences.fasta \

-f -w 0 -2 -p ${thread} \

-O ./${out_other}/sep_seq/

# import splited fasta

ls ./${out_other}/sep_seq/*fasta | \

sed -e 's/\.fasta//g' -e 's/\.\/01output\/other\/sep_seq\///g' | \

while read fasta; do \

qiime tools import \

--input-path ./${out_other}/sep_seq/${fasta}.fasta \

--output-path ./${out_other}/sep_seq/${fasta}.qza \

--input-format "DNAFASTAFormat" \

--type "FeatureData[Sequence]"; done

# import reference fasta and tsv file

qiime tools import \

--input-path ./${ref}/16S_ribosomal_RNA_seq_acc.fasta \

--output-path ./${ref}/16S_ribosomal_RNA_seq_acc.qza \

--input-format "DNAFASTAFormat" \

--type "FeatureData[Sequence]"

qiime tools import \

--input-path ./${ref}/16S_ribosomal_RNA_tax_qiime2-form.tsv \

--output-path ./${ref}/16S_ribosomal_RNA_qiime2-form.qza \

--type 'FeatureData[Taxonomy]' \

--input-format TSVTaxonomyFormat

# Parallelize classify-consensus-blast

thread=24

time ls ./${out_other}/sep_seq/*qza | sed -e 's/\.qza//g' -e 's/\.\/01output\/other\/sep_seq\///g' | \

parallel -j ${thread} \

qiime feature-classifier classify-consensus-blast \

--i-query ./${out_other}/sep_seq/{}.qza \

--i-reference-reads ./${ref}/16S_ribosomal_RNA_seq_acc.qza \

--i-reference-taxonomy ./${ref}/16S_ribosomal_RNA_qiime2-form.qza \

--o-classification ./${out_qza}/Blasted_res_{}.qza \

--o-search-results ./${blast_res}/{}

# Convert blast results

ls ./${blast_res}/*qza | \

sed -e 's/\.qza//g' -e 's/\.\/01output\/other\/blast_res\///g' | \

while read blast; do

qiime tools export \

--input-path ./${blast_res}/${blast}.qza \

--output-path ./${blast_res}/${blast}

done

# concatenate blast result files

find ./${blast_res}/dna-sequences.part_*/* -type f | xargs cat | sort > ./${out_other}/cat_blastres.tsv

# Import blast result tsv file

qiime tools import \

--input-path ./${out_other}/cat_blastres.tsv \

--output-path ./${out_qza}/blast_results.qza \

--type "FeatureData[BLAST6]" \

--input-format "BLAST6Format"

免責事項

本記事の情報・内容については可能な限り十分注意は払っていますが、正確性や完全性、妥当性および公平性について保証するものではありません。本記事の利用や閲覧によって生じたいかなる損害について責任を負いません。また、本記事の情報は予告なく変更される場合がありますので、ご理解くださいますようお願い申し上げます。

データ解析