COG(Clusters of Orthologous Groups of proteins,直系同源蛋白簇)構成每個COG的蛋白都是被假定為來自於一個祖先蛋白,因此是orthologs或者是paralogs。
通過把所有完整基因組的編碼蛋白一個一個的互相比較確定的。在考慮來自一個給定基因組的蛋白時,這種比較將給出每個其他基因組的一個最相似的蛋白(因此需要用完整的基因組來定義COG),這些基因的每一個都輪番的被考慮。如果在這些蛋白(或子集)之間一個相互的最佳匹配關係被發現,那麼那些相互的最佳匹配將形成一個COG。這樣,一個COG中的成員將與這個COG中的其他成員比起被比較的基因組中的其他蛋白更相像。
主頁:https://www.ncbi.nlm.nih.gov/COG/
COG單字母描述,詳見 http://www.sbg.bio.ic.ac.uk/~phunkee/html/old/COG_classes.html
COG one letter code descriptions
INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics
CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones
METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism
POORLY CHARACTERIZED
eggNOG簡介eggNOG注釋的原理和解讀
通過已知蛋白對未知序列進行功能注釋;
通過查看指定的eggNOG編號對應的protein數目,存在及缺失,從而能推導特定的代謝途徑是否存在;
每個eggNOG編號是一類蛋白,將query序列和比對上的eggNOG編號的proteins進行多序列比對,能確定保守位點,分析其進化關係。
eggNOG mapper在線版eggNOG-mapper就比對、注釋eggNOG資料庫的專用工具。
eggNOG-mapper在線分析,只需滑鼠單擊三步完成。
1.訪問在線工具
http://eggnogdb.embl.de/#/app/emapper
2.參數設置
主要是選擇蛋白序列文件,和設置郵箱。一般其它默認即可。
注意方法選擇:diamond在序列少時相對較慢,但序列多時相對較快。HMMER方法對於親源較遠序列預測成功率更高,但數據量大時計算時間長,在線限制一次最多5000條序列。
3.提交任務
點擊Run按扭即提交任務。會出現如下窗口。
出現任務狀態,和引文列表頁面。值得注意的是,在線分析,即有序列限制,又要排隊,如果用的人多,有時需要等很久。
eggNOG mapper本地版更推薦conda安裝,輕鬆稿定依賴關係和環境變量
conda install eggnog-mapper
手動軟體下載和安裝
cd ~/software
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3
軟體說明
less README.md
使用eggNOG資料庫進行功能注釋新基因、蛋白序列。常用於新基因組、轉錄組和宏基因組的基因集。直系同源(orthology)功能預測認為比傳統的同源搜索更準確,可以避免直接從旁系同源(paralogs)借用功能注釋(基因重複有很高的機會形成功能分化)。
幫助文檔
https://github.com/jhcepas/eggnog-mapper/wiki
安裝說明軟體依賴python2.7, wget, hmmer3, diamond,
硬碟空間要求:
內存要求:
HMMER3注釋時大內存時非常快,內存需要如下:
真核資料庫euk: ~90GB
細菌資料庫bact:~32GB
古細菌資料庫arch:~10GB
軟體安裝上面使用conda或wget下載方式安裝,還可選git方式
git clone https://github.com/jhcepas/eggnog-mapper.git
資料庫下載顯示程序幫助
python eggnog-mapper/download_eggnog_data.py -h
下載四個常用資料庫,保存於data目錄。
指定程序下載至指定目錄,並y自動同意,f強制下載
mkdir -p eggnog
python eggnog-mapper/download_eggnog_data.py --data_dir eggnog -y -f euk bact arch viruses
cd eggnog-mapper
HMMER方法本地檢索細菌資料庫
Disk based searches on the optimized bacterial database
-i輸入、—output輸出文件前綴、-d指定資料庫數據、—data_dir指定資料庫位置
python emapper.py -i test/polb.fa --output polb_bact -d bact --data_dir ~/data/db/eggnog
diamond方法-m指定diamond方法,默認為hmmer方法。diamond在多於千條序列時才會體現速度優勢,少量序列會感覺非常慢,而且結果也沒有hmmer的更準確,尤其是對遠源注釋方面。
python emapper.py -i test/polb.fa --output diamond_bact_ -d bact --data_dir ~/data/db/eggnog -m diamond
時間較長,1個多小時
結果解讀https://github.com/jhcepas/eggnog-mapper/wiki/Results-Interpretation
結果有三個文件
polb_bact.emapper.annotations
polb_bact.emapper.hmm_hits
polb_bact.emapper.seed_orthologs
主要關注annotations結果,其中包括基因對應的GO、KEGG和COG描述
[project_name].emapper.hmm_hits文件:hmm比對結果列表
For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.
[project_name].emapper.seed_orthologs文件:最佳結果列表
each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.
[project_name].emapper.annotations文件:比對結果整理,這才是重點。
This file provides final annotations of each query. Tab-delimited columns in the file are:
制表符分隔的13列文件,如下:
序列名query_name: query sequence name
eggNOG編號seed_eggNOG_ortholog: best protein match in eggNOG
seed_ortholog_evalue: best protein match (e-value)
seed_ortholog_score: best protein match (bit-score)
預測基因名predicted_gene_name: Predicted gene name for query sequences
逗號分隔的GO注釋GO_terms: Comma delimited list of predicted Gene Ontology terms
KO編號注釋KEGG_KO: Comma delimited list of predicted KEGG KOs
代謝反應BiGG_Reactions: Comma delimited list of predicted BiGG metabolic reactions
注釋物種範圍Annotation_tax_scope: The taxonomic scope used to annotate this query sequence
OG編號Matching_OGs: Comma delimited list of matching eggNOG Orthologous Groups
best_OG|evalue|score: Best matching Orthologous Groups (only in HMM mode)
COG分類COG functional categories: COG functional category inferred from best matching OG
模型注釋eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG
高級使用https://github.com/jhcepas/eggnog-mapper/wiki/Advanced-usage-and-tips
大內存和多線程加速
—usemem可讀入全部數據進內存,可使用內存預測加載數據,—cpu可設置多線程,—override是強制覆蓋結果,否則有結果文件會中止
python emapper.py -i test/polb.fa --output polb_bact --database bact --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 11.8659 secs
先讀入細菌庫,資料庫選擇,僅能指定某一類資料庫
python emapper.py —database bact —data_dir ~/data/db/eggnog —cpu 10 —servermode
需要時間讀入數據
Waiting for server to become ready... localhost 51500
直到顯示:
Server ready listening at localhost:51500 and using 10 CPU cores
Use `emapper.py -d bact:localhost:51500 (...)` to search against this server
再啟動分析命令
—usemem可讀入全部數據進內存,可使用內存預測加載數據,—cpu可設置多線程
python emapper.py -i test/polb.fa --output polb_bact --database bact:localhost:51500 --data_dir ~/data/db/eggnog --usemem --cpu 10 --override
# Total time: 9.77332 secs
https://github.com/jhcepas/eggnog-mapper/wiki/Setting-up-large-scale-analyses
大基因組,和宏基因組數據的注釋(>100M的蛋白)。
分析主要分兩步:同源檢索,計算密集;功能注釋,讀寫密集。數據拆分會提高效率。
同源檢索1. 序列拆分
準備文件並調整為單行fasta
cp /mnt/bai/yongxin/test/meta1809/temp/23prokka_all/mg.faa input_file0.faa
format_fasta_1line.pl -i input_file0.faa -o input_file.faa
拆分為文件,每個2百萬行,1百萬條序列。這裡測序用10000行,5000條序列。
# -l按行數分割,-a後綴寬度3位,默認2位;-d數據後綴
split -l 10000 -a 3 -d input_file.faa input_file.chunk_
2.並行比對
方法1. 產生命令用於集群
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
方法2. 並行計算
time parallel -j 3 --xapply \
'python emapper.py -m diamond --no_annot --no_file_comments --data_dir ~/data/db/eggnog --cpu 16 -i {1} -o {1}' \
::: input_file.chunk*
耗時 real 14m45.579s
功能注釋此步為硬碟密集型,推薦將eggnog.db存儲於SSD硬碟,或/dev/shm內存目錄中
3. 合併比對結果
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
4.注釋
為了提高速度,將資料庫複製到內存,21s
cp ~/data/db/eggnog/eggnog.db /dev/shm
time emapper.py --annotate_hits_table input_file.emapper.seed_orthologs --no_file_comments -o output_file --cpu 20 --data_dir /dev/shm --override
資料庫在內存時,處理1萬條序列大約15s
現在我們獲得了所有基因注釋的列表。配合基因豐度矩陣,可以進行可種匯總、差異比較、功能描述了。
附1. emapper.py參數詳解python emapper.py -h
usage: emapper.py [-h] [--guessdb] [--database] [--dbtype {hmmdb,seqdb}]
[--data_dir] [--qtype {hmm,seq}] [--tax_scope]
[--target_orthologs {one2one,many2one,one2many,many2many,all}]
[--excluded_taxa]
[--go_evidence {experimental,non-electronic}]
[--hmm_maxhits] [--hmm_evalue] [--hmm_score]
[--hmm_maxseqlen] [--hmm_qcov] [--Z] [--dmnd_db DMND_DB]
[--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
[--gapopen GAPOPEN] [--gapextend GAPEXTEND]
[--seed_ortholog_evalue] [--seed_ortholog_score] [--output]
[--resume] [--override] [--no_refine] [--no_annot]
[--no_search] [--report_orthologs] [--scratch_dir]
[--output_dir] [--temp_dir] [--no_file_comments]
[--keep_mapping_files] [-m {hmmer,diamond}] [-i]
[--translate] [--servermode] [--usemem] [--cpu]
[--annotate_hits_table] [--version]
optional arguments:
-h, --help 顯示幫助show this help message and exit
--version 版本號
Target HMM Database Options:
--guessdb 根據物種ID猜所屬資料庫guess eggnog db based on the provided taxid
--database , -d 資料庫選擇,僅能指定某一類資料庫specify the target database for sequence searches.Choose among: euk,bact,arch, host:port, or a local hmmpressed database
--dbtype {hmmdb,seqdb} 資料庫類型
--data_dir 數據目錄 Directory to use for DATA_PATH.
--qtype {hmm,seq} 方法選擇,序列少用hmm,序列多用seq
Annotation Options:
--tax_scope 設定物種範圍,默認自動調整Fix the taxonomic scope used for annotation, so only orthologs from a particular clade are used for functional transfer. By default, this is automatically adjusted for every query sequence.
--target_orthologs {one2one,many2one,one2many,many2many,all}
功能注釋類型 defines what type of orthologs should be used for functional transfer
--excluded_taxa (for debugging and benchmark purposes)
--go_evidence {experimental,non-electronic}
注釋準確度,只選實驗 Defines what type of GO terms should be used for
annotation:experimental = Use only terms inferred from
experimental evidencenon-electronic = Use only non-
electronically curated terms
HMM search_options:
--hmm_maxhits 匹配結果數量,默認1 Max number of hits to report. Default=1
--hmm_evalue E-value threshold. Default=0.001
--hmm_score Bit score threshold. Default=20
--hmm_maxseqlen 忽略序列大於5000的蛋白Ignore query sequences larger than `maxseqlen`.
Default=5000
--hmm_qcov min query coverage (from 0 to 1). Default=(disabled)
--Z Fixed database size used in phmmer/hmmscan (allows
comparing e-values among databases).
Default=40,000,000
diamond search_options:
--dmnd_db DMND_DB 資料庫位置Path to DIAMOND-compatible database
--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
Scoring matrix
--gapopen GAPOPEN Gap open penalty
--gapextend GAPEXTEND
Gap extend penalty
Seed ortholog search option:
--seed_ortholog_evalue
Min E-value expected when searching for seed eggNOG
ortholog. Applies to phmmer/diamond searches. Queries
not having a significant seed orthologs will not be
annotated. Default=0.001
--seed_ortholog_score
Min bit score expected when searching for seed eggNOG
ortholog. Applies to phmmer/diamond searches. Queries
not having a significant seed orthologs will not be
annotated. Default=60
Output options:
--output , -o base name for output files
--resume Resumes a previous execution skipping reported hits in
the output file.
--override Overwrites output files if they exist.
--no_refine Skip hit refinement, reporting only HMM hits.
--no_annot Skip functional annotation, reporting only hits
--no_search Skip HMM search mapping. Use existing hits file
--report_orthologs The list of orthologs used for functional transferred
are dumped into a separate file
--scratch_dir Write output files in a temporary scratch dir, move
them to final the final output dir when finished.
Speed up large computations using network file
systems.
--output_dir Where output files should be written
--temp_dir Where temporary files are created. Better if this is a
local disk.
--no_file_comments No header lines nor stats are included in the output
files
--keep_mapping_files Do not delete temporary mapping files used for
annotation (i.e. HMMER and DIAMOND search outputs)
Execution options:
-m {hmmer,diamond} 運行選項,默認為hmmer,可選diamondDefault:hmmer
-i 輸入文件 Input FASTA file containing query sequences
--translate 輸入核酸序列,翻譯為蛋白 Assume sequences are genes instead of proteins
--servermode 數據載入內存模式,方便反覆使用Loads target database in memory and keeps running in
server mode, so another instance of eggnog-mapper can
connect to this sever. Auto turns on the --usemem flag
--usemem 讀入整個資料庫至內存 If a local hmmpressed database is provided as target
using --db, this flag will allocate the whole database
in memory using hmmpgmd. Database will be unloaded
after execution.
--cpu 多線程
--annotate_hits_table
注釋結果 Annotatate TSV formatted table of query->hits. 4
fields required: query, hit, evalue, score. Implies
--no_search and --no_refine.
https://github.com/jhcepas/eggnog-mapper/wiki
[1] Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Jaime Huerta-Cepas, Kristoffer Forslund, Luis Pedro Coelho,Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork.Mol Biol Evol (2017). doi:10.1093/molbev/msx148
[2] eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Jaime Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer Bork. Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi: 10.1093/nar/gkv1248
猜你喜歡10000+:菌群分析 寶寶與貓狗 梅毒狂想曲 提DNA發Nature Cell專刊 腸道指揮大腦
系列教程:微生物組入門 Biostar 微生物組 宏基因組
專業技能:學術圖表 高分文章 生信寶典 不可或缺的人
一文讀懂:宏基因組 寄生蟲益處 進化樹
必備技能:提問 搜索 Endnote
文獻閱讀 熱心腸 SemanticScholar Geenmedical
擴增子分析:圖表解讀 分析流程 統計繪圖
16S功能預測 PICRUSt FAPROTAX Bugbase Tax4Fun
在線工具:16S預測培養基 生信繪圖
科研經驗:雲筆記 雲協作 公眾號
編程模板: Shell R Perl
生物科普: 腸道細菌 人體上的生命 生命大躍進 細胞暗戰 人體奧秘
寫在後面為鼓勵讀者交流、快速解決科研困難,我們建立了「宏基因組」專業討論群,目前己有國內外5000+ 一線科研人員加入。參與討論,獲得專業解答,歡迎分享此文至朋友圈,並掃碼加主編好友帶你入群,務必備註「姓名-單位-研究方向-職稱/年級」。技術問題尋求幫助,首先閱讀《如何優雅的提問》學習解決問題思路,仍末解決群內討論,問題不私聊,幫助同行。
學習16S擴增子、宏基因組科研思路和分析實戰,關注「宏基因組」
點擊閱讀原文,跳轉最新文章目錄閱讀