EggNOG功能注釋資料庫在線和本地使用

2021-02-15 宏基因組

COG簡介

COG（Clusters of Orthologous Groups of proteins，直系同源蛋白簇）構成每個COG的蛋白都是被假定為來自於一個祖先蛋白，因此是orthologs或者是paralogs。
通過把所有完整基因組的編碼蛋白一個一個的互相比較確定的。在考慮來自一個給定基因組的蛋白時，這種比較將給出每個其他基因組的一個最相似的蛋白（因此需要用完整的基因組來定義COG），這些基因的每一個都輪番的被考慮。如果在這些蛋白（或子集）之間一個相互的最佳匹配關係被發現，那麼那些相互的最佳匹配將形成一個COG。這樣，一個COG中的成員將與這個COG中的其他成員比起被比較的基因組中的其他蛋白更相像。

主頁：https://www.ncbi.nlm.nih.gov/COG/

COG單字母描述，詳見 http://www.sbg.bio.ic.ac.uk/~phunkee/html/old/COG_classes.html

COG one letter code descriptions

INFORMATION STORAGE AND PROCESSING

[J] Translation, ribosomal structure and biogenesis

[A] RNA processing and modification

[K] Transcription

[L] Replication, recombination and repair

[B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING

[D] Cell cycle control, cell division, chromosome partitioning

[Y] Nuclear structure

[V] Defense mechanisms

[T] Signal transduction mechanisms

[M] Cell wall/membrane/envelope biogenesis

[N] Cell motility

[Z] Cytoskeleton

[W] Extracellular structures

[U] Intracellular trafficking, secretion, and vesicular transport

[O] Posttranslational modification, protein turnover, chaperones

METABOLISM

[C] Energy production and conversion

[G] Carbohydrate transport and metabolism

[E] Amino acid transport and metabolism

[F] Nucleotide transport and metabolism

[H] Coenzyme transport and metabolism

[I] Lipid transport and metabolism

[P] Inorganic ion transport and metabolism

[Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED

eggNOG簡介

eggNOG注釋的原理和解讀

通過已知蛋白對未知序列進行功能注釋；

通過查看指定的eggNOG編號對應的protein數目，存在及缺失，從而能推導特定的代謝途徑是否存在；

每個eggNOG編號是一類蛋白，將query序列和比對上的eggNOG編號的proteins進行多序列比對，能確定保守位點，分析其進化關係。

eggNOG mapper在線版

eggNOG-mapper就比對、注釋eggNOG資料庫的專用工具。

eggNOG-mapper在線分析，只需滑鼠單擊三步完成。

1.訪問在線工具

http://eggnogdb.embl.de/#/app/emapper

2.參數設置

主要是選擇蛋白序列文件，和設置郵箱。一般其它默認即可。

注意方法選擇：diamond在序列少時相對較慢，但序列多時相對較快。HMMER方法對於親源較遠序列預測成功率更高，但數據量大時計算時間長，在線限制一次最多5000條序列。

3.提交任務

點擊Run按扭即提交任務。會出現如下窗口。

出現任務狀態，和引文列表頁面。值得注意的是，在線分析，即有序列限制，又要排隊，如果用的人多，有時需要等很久。

eggNOG mapper本地版

更推薦conda安裝，輕鬆稿定依賴關係和環境變量

conda install eggnog-mapper

手動軟體下載和安裝

cd ~/software
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3

軟體說明

less README.md

使用eggNOG資料庫進行功能注釋新基因、蛋白序列。常用於新基因組、轉錄組和宏基因組的基因集。直系同源(orthology)功能預測認為比傳統的同源搜索更準確，可以避免直接從旁系同源(paralogs)借用功能注釋(基因重複有很高的機會形成功能分化)。

幫助文檔

https://github.com/jhcepas/eggnog-mapper/wiki

安裝說明

軟體依賴python2.7, wget, hmmer3, diamond,

硬碟空間要求：

內存要求：

HMMER3注釋時大內存時非常快，內存需要如下：

真核資料庫euk: ~90GB

細菌資料庫bact：~32GB

古細菌資料庫arch：~10GB

軟體安裝

上面使用conda或wget下載方式安裝，還可選git方式

git clone https://github.com/jhcepas/eggnog-mapper.git

資料庫下載

顯示程序幫助

python eggnog-mapper/download_eggnog_data.py -h

下載四個常用資料庫，保存於data目錄。
指定程序下載至指定目錄，並y自動同意，f強制下載

mkdir -p eggnog
python eggnog-mapper/download_eggnog_data.py --data_dir eggnog -y -f euk bact arch viruses

基本使用

cd eggnog-mapper

HMMER方法

本地檢索細菌資料庫
Disk based searches on the optimized bacterial database
-i輸入、—output輸出文件前綴、-d指定資料庫數據、—data_dir指定資料庫位置

python emapper.py -i test/polb.fa --output polb_bact -d bact --data_dir ~/data/db/eggnog

diamond方法

-m指定diamond方法，默認為hmmer方法。diamond在多於千條序列時才會體現速度優勢，少量序列會感覺非常慢，而且結果也沒有hmmer的更準確，尤其是對遠源注釋方面。

python emapper.py -i test/polb.fa --output diamond_bact_ -d bact --data_dir ~/data/db/eggnog -m diamond

時間較長，1個多小時

結果解讀

https://github.com/jhcepas/eggnog-mapper/wiki/Results-Interpretation

結果有三個文件

polb_bact.emapper.annotations
polb_bact.emapper.hmm_hits
polb_bact.emapper.seed_orthologs

主要關注annotations結果，其中包括基因對應的GO、KEGG和COG描述

[project_name].emapper.hmm_hits文件：hmm比對結果列表

For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.

[project_name].emapper.seed_orthologs文件：最佳結果列表

each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.

[project_name].emapper.annotations文件：比對結果整理，這才是重點。
This file provides final annotations of each query. Tab-delimited columns in the file are:

制表符分隔的13列文件，如下：

序列名query_name: query sequence name

eggNOG編號seed_eggNOG_ortholog: best protein match in eggNOG

seed_ortholog_evalue: best protein match (e-value)

seed_ortholog_score: best protein match (bit-score)

預測基因名predicted_gene_name: Predicted gene name for query sequences

逗號分隔的GO注釋GO_terms: Comma delimited list of predicted Gene Ontology terms

KO編號注釋KEGG_KO: Comma delimited list of predicted KEGG KOs

代謝反應BiGG_Reactions: Comma delimited list of predicted BiGG metabolic reactions

注釋物種範圍Annotation_tax_scope: The taxonomic scope used to annotate this query sequence

OG編號Matching_OGs: Comma delimited list of matching eggNOG Orthologous Groups

best_OG|evalue|score: Best matching Orthologous Groups (only in HMM mode)

COG分類COG functional categories: COG functional category inferred from best matching OG

模型注釋eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG

高級使用

https://github.com/jhcepas/eggnog-mapper/wiki/Advanced-usage-and-tips

大內存和多線程加速

—usemem可讀入全部數據進內存，可使用內存預測加載數據，—cpu可設置多線程，—override是強制覆蓋結果，否則有結果文件會中止

python emapper.py -i test/polb.fa --output polb_bact --database bact --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 11.8659 secs

伺服器共用內存模式

先讀入細菌庫，資料庫選擇，僅能指定某一類資料庫

python emapper.py —database bact —data_dir ~/data/db/eggnog —cpu 10 —servermode

需要時間讀入數據

Waiting for server to become ready... localhost 51500

直到顯示：

Server ready listening at localhost:51500 and using 10 CPU cores
Use `emapper.py -d bact:localhost:51500 (...)` to search against this server

再啟動分析命令

—usemem可讀入全部數據進內存，可使用內存預測加載數據，—cpu可設置多線程

python emapper.py -i test/polb.fa --output polb_bact --database bact:localhost:51500 --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 9.77332 secs

宏基因組大數據模式

https://github.com/jhcepas/eggnog-mapper/wiki/Setting-up-large-scale-analyses

大基因組，和宏基因組數據的注釋(>100M的蛋白)。

分析主要分兩步：同源檢索，計算密集；功能注釋，讀寫密集。數據拆分會提高效率。

同源檢索

1. 序列拆分

準備文件並調整為單行fasta

cp /mnt/bai/yongxin/test/meta1809/temp/23prokka_all/mg.faa input_file0.faa
format_fasta_1line.pl -i input_file0.faa -o input_file.faa

拆分為文件，每個2百萬行，1百萬條序列。這裡測序用10000行，5000條序列。

# -l按行數分割，-a後綴寬度3位，默認2位；-d數據後綴
split -l 10000 -a 3 -d input_file.faa input_file.chunk_

2.並行比對

方法1. 產生命令用於集群

for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done

方法2. 並行計算

time parallel -j 3 --xapply \
'python emapper.py -m diamond --no_annot --no_file_comments --data_dir ~/data/db/eggnog --cpu 16 -i {1} -o {1}' \
::: input_file.chunk*

耗時 real 14m45.579s

功能注釋

此步為硬碟密集型，推薦將eggnog.db存儲於SSD硬碟，或/dev/shm內存目錄中

3. 合併比對結果

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs

4.注釋

為了提高速度，將資料庫複製到內存，21s

cp ~/data/db/eggnog/eggnog.db /dev/shm

time emapper.py --annotate_hits_table input_file.emapper.seed_orthologs --no_file_comments -o output_file --cpu 20 --data_dir /dev/shm --override

資料庫在內存時，處理1萬條序列大約15s

現在我們獲得了所有基因注釋的列表。配合基因豐度矩陣，可以進行可種匯總、差異比較、功能描述了。

附1. emapper.py參數詳解

python emapper.py -h
usage: emapper.py [-h] [--guessdb] [--database] [--dbtype {hmmdb,seqdb}]
[--data_dir] [--qtype {hmm,seq}] [--tax_scope]
[--target_orthologs {one2one,many2one,one2many,many2many,all}]
[--excluded_taxa]
[--go_evidence {experimental,non-electronic}]
[--hmm_maxhits] [--hmm_evalue] [--hmm_score]
[--hmm_maxseqlen] [--hmm_qcov] [--Z] [--dmnd_db DMND_DB]
[--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
[--gapopen GAPOPEN] [--gapextend GAPEXTEND]
[--seed_ortholog_evalue] [--seed_ortholog_score] [--output]
[--resume] [--override] [--no_refine] [--no_annot]
[--no_search] [--report_orthologs] [--scratch_dir]
[--output_dir] [--temp_dir] [--no_file_comments]
[--keep_mapping_files] [-m {hmmer,diamond}] [-i]
[--translate] [--servermode] [--usemem] [--cpu]
[--annotate_hits_table] [--version]

optional arguments:
-h, --help 顯示幫助show this help message and exit
--version 版本號

Target HMM Database Options:
--guessdb 根據物種ID猜所屬資料庫guess eggnog db based on the provided taxid
--database , -d 資料庫選擇，僅能指定某一類資料庫specify the target database for sequence searches.Choose among: euk,bact,arch, host:port, or a local hmmpressed database
--dbtype {hmmdb,seqdb} 資料庫類型
--data_dir 數據目錄 Directory to use for DATA_PATH.
--qtype {hmm,seq} 方法選擇，序列少用hmm，序列多用seq

Annotation Options:
--tax_scope 設定物種範圍，默認自動調整Fix the taxonomic scope used for annotation, so only orthologs from a particular clade are used for functional transfer. By default, this is automatically adjusted for every query sequence.
--target_orthologs {one2one,many2one,one2many,many2many,all}
功能注釋類型 defines what type of orthologs should be used for functional transfer
--excluded_taxa (for debugging and benchmark purposes)
--go_evidence {experimental,non-electronic}
注釋準確度，只選實驗 Defines what type of GO terms should be used for
annotation:experimental = Use only terms inferred from
experimental evidencenon-electronic = Use only non-
electronically curated terms

HMM search_options:
--hmm_maxhits 匹配結果數量，默認1 Max number of hits to report. Default=1
--hmm_evalue E-value threshold. Default=0.001
--hmm_score Bit score threshold. Default=20
--hmm_maxseqlen 忽略序列大於5000的蛋白Ignore query sequences larger than `maxseqlen`.
Default=5000
--hmm_qcov min query coverage (from 0 to 1). Default=(disabled)
--Z Fixed database size used in phmmer/hmmscan (allows
comparing e-values among databases).
Default=40,000,000

diamond search_options:
--dmnd_db DMND_DB 資料庫位置Path to DIAMOND-compatible database
--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
Scoring matrix
--gapopen GAPOPEN Gap open penalty
--gapextend GAPEXTEND
Gap extend penalty

Seed ortholog search option:
--seed_ortholog_evalue
Min E-value expected when searching for seed eggNOG
ortholog. Applies to phmmer/diamond searches. Queries
not having a significant seed orthologs will not be
annotated. Default=0.001
--seed_ortholog_score
Min bit score expected when searching for seed eggNOG
ortholog. Applies to phmmer/diamond searches. Queries
not having a significant seed orthologs will not be
annotated. Default=60

Output options:
--output , -o base name for output files
--resume Resumes a previous execution skipping reported hits in
the output file.
--override Overwrites output files if they exist.
--no_refine Skip hit refinement, reporting only HMM hits.
--no_annot Skip functional annotation, reporting only hits
--no_search Skip HMM search mapping. Use existing hits file
--report_orthologs The list of orthologs used for functional transferred
are dumped into a separate file
--scratch_dir Write output files in a temporary scratch dir, move
them to final the final output dir when finished.
Speed up large computations using network file
systems.
--output_dir Where output files should be written
--temp_dir Where temporary files are created. Better if this is a
local disk.
--no_file_comments No header lines nor stats are included in the output
files
--keep_mapping_files Do not delete temporary mapping files used for
annotation (i.e. HMMER and DIAMOND search outputs)

Execution options:
-m {hmmer,diamond} 運行選項，默認為hmmer，可選diamondDefault:hmmer
-i 輸入文件 Input FASTA file containing query sequences
--translate 輸入核酸序列，翻譯為蛋白 Assume sequences are genes instead of proteins
--servermode 數據載入內存模式，方便反覆使用Loads target database in memory and keeps running in
server mode, so another instance of eggnog-mapper can
connect to this sever. Auto turns on the --usemem flag
--usemem 讀入整個資料庫至內存 If a local hmmpressed database is provided as target
using --db, this flag will allocate the whole database
in memory using hmmpgmd. Database will be unloaded
after execution.
--cpu 多線程
--annotate_hits_table
注釋結果 Annotatate TSV formatted table of query->hits. 4
fields required: query, hit, evalue, score. Implies
--no_search and --no_refine.

Reference

https://github.com/jhcepas/eggnog-mapper/wiki

[1] Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho,Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork.Mol Biol Evol (2017). doi:10.1093/molbev/msx148

[2] eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Jaime Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer Bork. Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi: 10.1093/nar/gkv1248

EggNOG功能注釋資料庫在線和本地使用

相關焦點

免費又好用的基因功能注釋平臺

技術貼 | 宏基因組分箱 (Binning)第四課——COG EC RNA注釋統計

Nucleic Acids Reasearch 重磅推薦 | NCBI多個核心資料庫:核酸序列、PubMed等更新與使用指南!

Uniprot,一個熟悉又陌生的資料庫|使用Uniport獲取相關蛋白注釋信息(一)

天津大學:新型冠狀病毒基因組注釋資料庫向全球開放

使用snpEff對VCF進行注釋

Annolnc:一站式lncRNA查詢資料庫

植物科學常用資料庫和生物信息學工具

UniProt 資料庫介紹

webgestalt:基因富集分析的在線工具

LEfSe分析的在線+本地運行的超詳細教程

環狀RNA常用資料庫使用介紹

一文教你如何掌握基因功能(GO)和信號通路(Pathway)分析

使用ggtree實現進化樹的可視化和注釋

植物科學常用資料庫和生物信息學工具

12月在線資料庫匯總|資料庫|甲基化|DNA|標誌物|預測|分析|-健康界

動物所建立靈長類特異新基因資料庫並系統預測新基因功能

植物科學常用資料庫和生物信息學工具 2020正式版

EggNOG功能注釋資料庫在線和本地使用

相關焦點

免費又好用的基因功能注釋平臺

技術貼 | 宏基因組分箱 (Binning)第四課——COG EC RNA注釋統計

Nucleic Acids Reasearch 重磅推薦 | NCBI多個核心資料庫:核酸序列、PubMed等更新與使用指南!

​Uniprot,一個熟悉又陌生的資料庫|使用Uniport獲取相關蛋白注釋信息(一)

天津大學:新型冠狀病毒基因組注釋資料庫向全球開放

使用snpEff對VCF進行注釋

Annolnc:一站式lncRNA查詢資料庫

植物科學常用資料庫和生物信息學工具

UniProt 資料庫介紹

webgestalt:基因富集分析的在線工具

LEfSe分析的在線+本地運行的超詳細教程

環狀RNA常用資料庫使用介紹

一文教你如何掌握基因功能(GO)和信號通路(Pathway)分析

使用ggtree實現進化樹的可視化和注釋

植物科學常用資料庫和生物信息學工具

12月在線資料庫匯總|資料庫|甲基化|DNA|標誌物|預測|分析|-健康界

動物所建立靈長類特異新基因資料庫並系統預測新基因功能

植物科學常用資料庫和生物信息學工具 2020正式版

Uniprot,一個熟悉又陌生的資料庫|使用Uniport獲取相關蛋白注釋信息(一)