diamond大基因序列快速比对工具使用详解-包含超算集群多节点计算使用方法

本文主要是介绍diamond大基因序列快速比对工具使用详解-包含超算集群多节点计算使用方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Diamond是一款快速的序列比对工具,其使用方法如下:

1. 安装Diamond:

可从官方网站(https://github.com/bbuchfink/diamond/releases)下载安装包,并安装到本地电脑中。当然还有docker,conda以及编译安装方式,一般用不上,但注意新版对gcc的要求高,出现gcc错误时可选择下载低版本的diamond或者升级gcc到指定版本以上。

#下载diamond程序文件
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
###其他版本直接访问http://github.com/bbuchfink/diamond/releases/download/查看#解压会出来一个diamond的文件
​tar -xzvf diamond-linux64.tar.gz
#移到系统环境目录、或将当前目录加入系统环境目录,或者直接使用路径加diamond命令运行
diamond blastx./diamond blastx/opt/diamond blastx

2. 准备数据集:

首先需要准备用于比对的序列数据集,比如fasta格式的序列文件。

#下载nr数据库,或这自己需要的数据库
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
gunzip nr.gz
#使用diamond命令创建dimond格式数据库
diamond makedb --in nr --db nr

3. 运行Diamond:

常规使用

在终端中输入以下命令,即可启动Diamond程序并运行比对任务:
diamond blastx -d [参考序列文件] -q [待比对序列文件] -o [输出文件名]

#下载nr数据库,或这自己需要的数据库
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
gunzip nr.gz
#使用diamond命令创建dimond格式数据库
diamond makedb --in nr --db nr#命令使用
diamond blastx --db nr -q reads.fna -o dna_matches_fmt6.txt
diamond blastp --db nr -q reads.faa -o protein_matches_fmt6.txt

其中,blastx表示使用蛋白质序列比对算法,-d和-q分别指定参考序列文件和待比对序列文件,-o指定输出文件名。

超算集群多计算节点并行计算(私房菜)Distributed computing

diamond尽管速度快,但对于大文件进行比对时,大于1G以上的文件对于40核的单个节点可能仍然需要几天的时间,如果有较多的节点时,可以使用多节点的并行计算,这一点太给力了。

准备工作(重要,否则不成功):

1、将diamond程序目录在各节点间共享

2、样品序列目录在各节点间共享

3、所有节点使用相同的临时目录在各节点间共享。

# Diamond distributed-memory parallel processing
#Diamond supports the parallel processing of large alignments on HPC clusters and #supercomputers, spanning numerous compute nodes. Work distribution is orchestrated by #Diamond via a shared file system typically available on such clusters, using lightweight #file-based stacks and POSIX functionality.#Usage
#To run Diamond in parallel, two steps need to be performed. First, during a short #initialization run using a single process, the query and database are scanned and chunks #of work are written to the file-based stacks on the parallel file system. Second, the #actual parallel run is performed, where multiple DIAMOND processes on different compute #nodes work on chunks of the query and the reference database to perform alignments and #joins.#Initialization 先进行任务初始化,这个只需要在第一个节点上初始化就行了。其他节点直接启动后面一步的并行计算命令就行
#The initialization of a parallel run should be done (e.g. interactively on a login node) #using the parameters --multiprocessing --mp-init as follows:diamond blastp --db DATABASE.dmnd --query QUERY.fasta --multiprocessing --mp-init --tmpdir $TMPDIR --parallel-tmpdir $PTMPDIR#Here $TMPDIR refers to a local temporary directory, whereas $PTMPDIR refers to a #directory in the parallel file system where the file-based stacks containing the work #packages will be created. Note that the size of the chunking and thereby the number of #work packages is controlled via the --block-size parameter.#Parallel run 开始真实的并行计算,可以在所有计算节点启动
#The actual parallel run should be done using the parameter --multiprocessing as follows:diamond blastp --db DATABASE.dmnd --query QUERY.fasta -o OUTPUT_FILE --multiprocessing --tmpdir $TMPDIR --parallel-tmpdir $PTMPDIR#这里特意说明文件夹与任务初始化文件夹的一致性,主要是临时计算目录tmpdir
#Note that $PTMPDIR must refer to the same location as used during the initialization, #and it must be accessible from any of the compute nodes involved. To launch the parallel #processes on many nodes, a batch system such as SLURM is typically used. For the output #not a single stream is used but rather multiple files are created, one for each query #chunk.#SLURM batch file example   slurm超算集群脚本,这个不多说了吧,使用这个更方便一点,没有也不用担心,使用前面那两个命令即可。
#The following script shows an example of how a massively parallel can be performed using #the SLURM batch system on a supercomputer.#!/bin/bash -l
#SBATCH -D ./
#SBATCH -J DIAMOND
#SBATCH --mem=185000
#SBATCH --nodes=520
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-core=2
#SBATCH --cpus-per-task=80
#SBATCH --mail-type=none
#SBATCH --time=24:00:00module purge
module load gcc impi
export SLURM_HINT=multithread###以下是超算的相关说明,重点关注前面配置即可。
srun diamond FLAGS
FLAGS refers to the aforementioned parallel flags for Diamond. Note that the actual configuration of the nodes varies between machines, and therefore, the parameters shown here are not of general applicability. It is recommended to start with few nodes on small problems, first.Abort and resume
Parallel runs can be aborted and later resumed, and unfinished work packages from a previous run can be recovered and resubmitted in a subsequent run.Using the option --multiprocessing --mp-recover for the same value of --parallel-tmpdir will scan the working directory and configure a new parallel run including only the work packages that have not been completed in the previous run.Placing a file stop in the working directory causes DIAMOND processes to shut down in a controlled way after finishing the current work package. After removing the stop file, the multiprocessing run can be continued.Parameter optimization
The granularity of the size of the work packages can be adjusted via the --block-size which at the same time affects the memory requirements at runtime. Parallel runs on more than 512 nodes of a supercomputer have been performed successfully.

4. 结果解读:

比对结束后,可以查看输出文件中的比对结果。Diamond的输出文件包含每个待比对序列的匹配结果,包括匹配的参考序列名、匹配位置、匹配得分等信息。

结果字段表示与原生blast结果表示相同:

见: 生物信息学分析-blast序列比对及结果详细说明-CSDN博客

5.帮助说明

 以上就是Diamond的基本使用方法,更详细的说明可以参考官方文档:https://github.com/bbuchfink/diamond/wiki。

# downloading the tool,下载工具
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
# creating a diamond-formatted database file  创建diamond数据库
./diamond makedb --in reference.fasta -d reference
# running a search in blastp mode  使用blastp模式比对序列
./diamond blastp -d reference -q queries.fasta -o matches.tsv
# running a search in blastx mode 使用blastx 模式比对序列
./diamond blastx -d reference -q reads.fasta -o matches.tsv
# downloading and using a BLAST database
update_blastdb.pl --decompress --blastdb_version 5 swissprot
./diamond prepdb -d swissprot
./diamond blastp -d swissprot -q queries.fasta -o matches.tsvSome important points to consider:Repeat masking is applied to the query and reference sequences by default. To disable it, use --masking 0.  默认情况下是允许重复结果,如果只输出最优结果就加上 --masking 0
DIAMOND is optimized for large input files of >1 million proteins. Naturally the tool can be used for smaller files as well, but the algorithm will not reach its full efficiency.
The program may use quite a lot of memory and also temporary disk space. Should the program fail due to running out of either one, you need to set a lower value for the block size parameter -b.  DIAMOND是大文件效率更好,对于小文件建议添加 -b 的参数
The sensitivity can be adjusted using the options --fast, --mid-sensitive, --sensitive, --more-sensitive, --very-sensitive and --ultra-sensitive.   比对敏感性,越往后其结果越接近原生blast结果,但速度也越慢,一般使用--more-sensitive比较适中,计算资源不够的就使用fast。

全参数帮助文件

下面是diamond的较为详细的帮助说明:自己慢慢看吧,不过一般不用特意设置了。

diamond --help
diamond v2.0.11.149 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)Syntax: diamond COMMAND [OPTIONS]Commands:
makedb	Build DIAMOND database from a FASTA file  #以fasta文件创建diamond格式数据库
blastp	Align amino acid query sequences against a protein reference database #功能与原生blastp功能一致
blastx	Align DNA query sequences against a protein reference database #功能与原生blastx一致
view	View DIAMOND alignment archive (DAA) formatted file
help	Produce help message
version	Display version information
getseq	Retrieve sequences from a DIAMOND database file
dbinfo	Print information about a DIAMOND database file
test	Run regression tests
makeidx	Make database indexGeneral options:
--threads (-p)           number of CPU threads #指定需要运行的线程数,可尽量大
--db (-d)                database file   #diamond makedb产生的diamond可使用格式的数据库
--out (-o)               output file  #比对结果输出命名
--outfmt (-f)            output format #outfmt,一般选6表格格式,与原生blast一致0   = BLAST pairwise5   = BLAST XML6   = BLAST tabular100 = DIAMOND alignment archive (DAA)101 = SAMValue 6 may be followed by a space-separated list of these keywords:qseqid means Query Seq - idqlen means Query sequence lengthsseqid means Subject Seq - idsallseqid means All subject Seq - id(s), separated by a ';'slen means Subject sequence lengthqstart means Start of alignment in queryqend means End of alignment in querysstart means Start of alignment in subjectsend means End of alignment in subjectqseq means Aligned part of query sequenceqseq_translated means Aligned part of query sequence (translated)full_qseq means Query sequencefull_qseq_mate means Query sequence of the matesseq means Aligned part of subject sequencefull_sseq means Subject sequenceevalue means Expect valuebitscore means Bit scorescore means Raw scorelength means Alignment lengthpident means Percentage of identical matchesnident means Number of identical matchesmismatch means Number of mismatchespositive means Number of positive - scoring matchesgapopen means Number of gap openingsgaps means Total number of gapsppos means Percentage of positive - scoring matchesqframe means Query framebtop means Blast traceback operations(BTOP)cigar means CIGAR stringstaxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)sscinames means unique Subject Scientific Name(s), separated by a ';'sskingdoms means unique Subject Super Kingdom(s), separated by a ';'skingdoms means unique Subject Kingdom(s), separated by a ';'sphylums means unique Subject Phylum(s), separated by a ';'stitle means Subject Titlesalltitles means All Subject Title(s), separated by a '<>'qcovhsp means Query Coverage Per HSPscovhsp means Subject Coverage Per HSPqtitle means Query titleqqual means Query quality values for the aligned part of the queryfull_qqual means Query quality valuesqstrand means Query strandDefault: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
--verbose (-v)           verbose console output
--log                    enable debug log
--quiet                  disable console output
--header                 Write header lines to blast tabular format.Makedb options:
--in                     input reference file in FASTA format
--taxonmap               protein accession to taxid mapping file
--taxonnodes             taxonomy nodes.dmp from NCBI
--taxonnames             taxonomy names.dmp from NCBIAligner options:
--query (-q)             input query file
--strand                 query strands to search (both/minus/plus)
--un                     file for unaligned queries
--al                     file or aligned queries
--unfmt                  format of unaligned query file (fasta/fastq)
--alfmt                  format of aligned query file (fasta/fastq)
--unal                   report unaligned queries (0=no, 1=yes)
--max-target-seqs (-k)   maximum number of target sequences to report alignments for (default=25)
--top                    report alignments within this percentage range of top alignment score (overrides --max-target-seqs)
--max-hsps               maximum number of HSPs per target sequence to report for each query (default=1)
--range-culling          restrict hit culling to overlapping query ranges
--compress               compression for output files (0=none, 1=gzip, zstd)
--evalue (-e)            maximum e-value to report alignments (default=0.001)
--min-score              minimum bit score to report alignments (overrides e-value setting)
--id                     minimum identity% to report an alignment
--query-cover            minimum query cover% to report an alignment
--subject-cover          minimum subject cover% to report an alignment
--fast                   enable fast mode
--mid-sensitive          enable mid-sensitive mode
--sensitive              enable sensitive mode)
--more-sensitive         enable more sensitive mode
--very-sensitive         enable very sensitive mode
--ultra-sensitive        enable ultra sensitive mode
--iterate                iterated search with increasing sensitivity
--global-ranking (-g)    number of targets for global ranking
--block-size (-b)        sequence block size in billions of letters (default=2.0)
--index-chunks (-c)      number of chunks for index processing (default=4)
--tmpdir (-t)            directory for temporary files
--parallel-tmpdir        directory for temporary files used by multiprocessing
--gapopen                gap open penalty
--gapextend              gap extension penalty
--frameshift (-F)        frame shift penalty (default=disabled)
--long-reads             short for --range-culling --top 10 -F 15
--matrix                 score matrix for protein alignment (default=BLOSUM62)
--custom-matrix          file containing custom scoring matrix
--comp-based-stats       composition based statistics mode (0-4)
--masking                enable tantan masking of repeat regions (0/1=default)
--query-gencode          genetic code to use to translate query (see user manual)
--salltitles             include full subject titles in DAA file
--sallseqid              include all subject ids in DAA file
--no-self-hits           suppress reporting of identical self hits
--taxonlist              restrict search to list of taxon ids (comma-separated)
--taxon-exclude          exclude list of taxon ids (comma-separated)
--seqidlist              filter the database by list of accessions
--skip-missing-seqids    ignore accessions missing in the databaseAdvanced options:
--algo                   Seed search algorithm (0=double-indexed/1=query-indexed/ctg=contiguous-seed)
--bin                    number of query bins for seed search
--min-orf (-l)           ignore translated sequences without an open reading frame of at least this length
--freq-sd                number of standard deviations for ignoring frequent seeds
--id2                    minimum number of identities for stage 1 hit
--xdrop (-x)             xdrop for ungapped alignment
--gapped-filter-evalue   E-value threshold for gapped filter (auto)
--band                   band for dynamic programming computation
--shapes (-s)            number of seed shapes (default=all available)
--shape-mask             seed shapes
--multiprocessing        enable distributed-memory parallel processing
--mp-init                initialize multiprocessing run
--mp-recover             enable continuation of interrupted multiprocessing run
--mp-query-chunk         process only a single query chunk as specified
--ext-chunk-size         chunk size for adaptive ranking (default=auto)
--no-ranking             disable ranking heuristic
--ext                    Extension mode (banded-fast/banded-slow/full)
--culling-overlap        minimum range overlap with higher scoring hit to delete a hit (default=50%)
--taxon-k                maximum number of targets to report per species
--range-cover            percentage of query range to be covered for range culling (default=50%)
--dbsize                 effective database size (in letters)
--no-auto-append         disable auto appending of DAA and DMND file extensions
--xml-blord-format       Use gnl|BL_ORD_ID| style format in XML output
--stop-match-score       Set the match score of stop codons against each other.
--tantan-minMaskProb     minimum repeat probability for masking (default=0.9)
--file-buffer-size       file buffer size in bytes (default=67108864)
--memory-limit (-M)      Memory limit for extension stage in GB
--no-unlink              Do not unlink temporary files.
--target-indexed         Enable target-indexed mode
--ignore-warnings        Ignore warningsView options:
--daa (-a)               DIAMOND alignment archive (DAA) file
--forwardonly            only show alignments of forward strandGetseq options:
--seq                    Sequence numbers to display.Online documentation at http://www.diamondsearch.org

新版本帮助文件:

新版本帮助更简洁,不在一个层次的命令不显示出来,以免混淆。

diamond --help
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)Syntax: diamond COMMAND [OPTIONS]Commands:
makedb                   Build DIAMOND database from a FASTA file
prepdb                   Prepare BLAST database for use with Diamond
blastp                   Align amino acid query sequences against a protein reference database
blastx                   Align DNA query sequences against a protein reference database
cluster                  Cluster protein sequences
linclust                 Cluster protein sequences in linear time
realign                  Realign clustered sequences against their centroids
recluster                Recompute clustering to fix errors
reassign                 Reassign clustered sequences to the closest centroid
view                     View DIAMOND alignment archive (DAA) formatted file
merge-daa                Merge DAA files
help                     Produce help message
version                  Display version information
getseq                   Retrieve sequences from a DIAMOND database file
dbinfo                   Print information about a DIAMOND database file
test                     Run regression tests
makeidx                  Make database index
greedy-vertex-cover      Compute greedy vertex coverPossible [OPTIONS] for COMMAND can be seen with syntax: diamond COMMANDOnline documentation at http://www.diamondsearch.org

要显示更具体的命令下的参数,直接增加功能命令回车即可显示,具体使用大家可在自己系统里面查看即可:

diamond makedb
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)Options:
--threads                number of CPU threads
--verbose                verbose console output
--log                    enable debug log
--quiet                  disable console output
--tmpdir                 directory for temporary files
--db                     database file
--in                     input reference file in FASTA format/input DAA files for merge-daa
--taxonmap               protein accession to taxid mapping file
--taxonnodes             taxonomy nodes.dmp from NCBI
--taxonnames             taxonomy names.dmp from NCBI
--file-buffer-size       file buffer size in bytes (default=67108864)
--no-unlink              Do not unlink temporary files.
--ignore-warnings        Ignore warnings
--no-parse-seqids        Print raw seqids without parsingError: Missing parameter: database file (--db/-d)

6. 参考文献

Benjamin Buchfink, Chao Xie, and Daniel H. Huson. Fast and sensitive protein alignment using diamond. Nature methods, 12(1):59–60, Jan 2015.

这篇关于diamond大基因序列快速比对工具使用详解-包含超算集群多节点计算使用方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/313446

相关文章

Redis 的 SUBSCRIBE命令详解

《Redis的SUBSCRIBE命令详解》Redis的SUBSCRIBE命令用于订阅一个或多个频道,以便接收发送到这些频道的消息,本文给大家介绍Redis的SUBSCRIBE命令,感兴趣的朋友跟随... 目录基本语法工作原理示例消息格式相关命令python 示例Redis 的 SUBSCRIBE 命令用于订

python获取指定名字的程序的文件路径的两种方法

《python获取指定名字的程序的文件路径的两种方法》本文主要介绍了python获取指定名字的程序的文件路径的两种方法,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要... 最近在做项目,需要用到给定一个程序名字就可以自动获取到这个程序在Windows系统下的绝对路径,以下

Java使用Javassist动态生成HelloWorld类

《Java使用Javassist动态生成HelloWorld类》Javassist是一个非常强大的字节码操作和定义库,它允许开发者在运行时创建新的类或者修改现有的类,本文将简单介绍如何使用Javass... 目录1. Javassist简介2. 环境准备3. 动态生成HelloWorld类3.1 创建CtC

JavaScript中的高级调试方法全攻略指南

《JavaScript中的高级调试方法全攻略指南》什么是高级JavaScript调试技巧,它比console.log有何优势,如何使用断点调试定位问题,通过本文,我们将深入解答这些问题,带您从理论到实... 目录观点与案例结合观点1观点2观点3观点4观点5高级调试技巧详解实战案例断点调试:定位变量错误性能分

使用Python批量将.ncm格式的音频文件转换为.mp3格式的实战详解

《使用Python批量将.ncm格式的音频文件转换为.mp3格式的实战详解》本文详细介绍了如何使用Python通过ncmdump工具批量将.ncm音频转换为.mp3的步骤,包括安装、配置ffmpeg环... 目录1. 前言2. 安装 ncmdump3. 实现 .ncm 转 .mp34. 执行过程5. 执行结

Python中 try / except / else / finally 异常处理方法详解

《Python中try/except/else/finally异常处理方法详解》:本文主要介绍Python中try/except/else/finally异常处理方法的相关资料,涵... 目录1. 基本结构2. 各部分的作用tryexceptelsefinally3. 执行流程总结4. 常见用法(1)多个e

Java使用jar命令配置服务器端口的完整指南

《Java使用jar命令配置服务器端口的完整指南》本文将详细介绍如何使用java-jar命令启动应用,并重点讲解如何配置服务器端口,同时提供一个实用的Web工具来简化这一过程,希望对大家有所帮助... 目录1. Java Jar文件简介1.1 什么是Jar文件1.2 创建可执行Jar文件2. 使用java

C#使用Spire.Doc for .NET实现HTML转Word的高效方案

《C#使用Spire.Docfor.NET实现HTML转Word的高效方案》在Web开发中,HTML内容的生成与处理是高频需求,然而,当用户需要将HTML页面或动态生成的HTML字符串转换为Wor... 目录引言一、html转Word的典型场景与挑战二、用 Spire.Doc 实现 HTML 转 Word1

Python实现精确小数计算的完全指南

《Python实现精确小数计算的完全指南》在金融计算、科学实验和工程领域,浮点数精度问题一直是开发者面临的重大挑战,本文将深入解析Python精确小数计算技术体系,感兴趣的小伙伴可以了解一下... 目录引言:小数精度问题的核心挑战一、浮点数精度问题分析1.1 浮点数精度陷阱1.2 浮点数误差来源二、基础解决

SpringBoot日志级别与日志分组详解

《SpringBoot日志级别与日志分组详解》文章介绍了日志级别(ALL至OFF)及其作用,说明SpringBoot默认日志级别为INFO,可通过application.properties调整全局或... 目录日志级别1、级别内容2、调整日志级别调整默认日志级别调整指定类的日志级别项目开发过程中,利用日志