本文轉載自肖斌科學網博客
16s rRNA早期的分析策略,如FISH(fluorescent in situ hybridization)、DDGE(denaturing gradient gel electrophoresis)、PCR cloning、T-RFLP(terminal restriction fragment length polymorphism)。隨著NGS(next generation sequencing)測序技術的發展,在此主要討論NGS技術在16s rRNA分析中的應用。
16s rRNA NGS數據分析的主要工具有:
16s rRNA NGS數據的分析主要有3個大步驟:
原始數據預處理:包括去接頭,數據過濾,信號雜音去除,嵌合體檢查,數據均一化;
微生物多樣性分析:OTU和OTU代表序列界定,包括OTU和代表序列的挑選,物種分類分配,進化樹分析等;
數據深入及可視化分析:包括alpha和beta多樣性分析,聚類和相關性分析,數據可視化等。
下面詳細說一下整個流程步驟~
16s經常是pooling測序,為此需要將下機數據根據barcode序列信息將數據拆分到各樣品中。QIIME中的「split_libraries.py」 和「split_libraries_fastq.py」實現數據拆分和數據過濾的雙重目的。Mothur利用「Trim.seqs」。不過QIIME和Mothur都不能直接處理sff文件(454下機產生的數據格式),不過可各自利用「process_sff.py」和Sffinfo將sff格式轉換為FASTA和QUAL文件。
數據過濾考慮的參數有:minimum average quality score allowed in a read、maximum number of ambiguous bases allowed、minimum and maximum sequence length、maximum length of homopolymer allowed、maximum mismatches inprimer or barcode allowed、whether to truncate reverse primer,and so on.
16s建庫的pcr過程、測序過程均會導致序列出現錯誤,在分析過程過程中需要有效排除這種錯誤。測序誤差矯正常用的工具有Denoiser(implemented in QIIME)、AmpliconNoise、Acacia、Pre.cluster(implemented in Mothur)。嵌合體查找的工具有ChimeraSlayer、UCHIME、Persus、DECIPHER,ChimeraSlayer、UCHIME、Persus在mothur中均可調用。在這些工具中,存在有待於優化的問題(these different methods often disagree with one another on the list of identified chimeras,probably because of their different mechanisms or algorithms. More efforts are required to evaluate these methods and coordinate their inconsistencies in chimera identification.)
在分析中有個關於古細菌序列的情況需要注意:a very small proportion of archaeal sequences may be generated for 16S rRNA gene amplicon datasets amplified with bacteria-specific primers. These unexpected sequences should be identified after denoising and chimera removal, and are advised to be discarded before subsequent data normalization.
測序深度不理想和不均勻的話會對alpha多樣性及beta多樣性均有影響。Uneven sequencing depth can affect diversity estimates in a single sample(i.e.,alpha diversity),as well as comparisons across different samples(i.e., beta diversity),thus data normalization is required. 對於此問題有兩種處理策略,分別是relative abundance and random sampling(i.e., rarefaction),in addition,z-score亦用於normalization的過程中。但不同的方法均會有缺點。
OTU的界定主要根據序列的一致性進行,(The OTUs are picked based on sequence identity,and various identity cutoffs of 16S rRNA gene have been used for different taxonomic ranks. For example, identity cutoffs recommended by MEGAN are 99 % for species,97 % for genus,95 % for family,and 90 % for order level,respectively)。OTU界定時選擇的工具與算法對後期分析有很大影響(The OTU picking strategy and algorithms have significant effects in the downstream data interpretation. )。
根據分析過程中是否使用資料庫,OTU界定的策略可分為de novo、closed reference和open reference。在OTU界定中有很多聚類的方法,There are many clustering or alignment tools available for OTU picking,such as Uclust,cd-hit,BLAST,mothur,usearch,and prefix/suffix. These tools are implemented in QIIME. Among them,the mothur method contains three clustering algorithms to pick de novo OTUs,namely, nearest neighbor,furthest neighbor,or average neighbor.
當序列聚類好後,代表了一個OTU,接下來就是從這個OTU找到代表序列,一種做法是a representative sequence can be a random,the longest,the most abundant(as default in QIIME), 另一種操作方法是the first sequence in an OTU cluster。 還有一種策略是the distance method in mothur identifies the sequence with the smallest maximum distance to the other sequences as the representative sequence。
taxonomic assignment的策略有:
word match,如RDP classfier;
best hit;
Lowest Common Ancestor,如MEGAN、SINA Alignment Service。
Phylogenetic relationships一般可以用樹來表示,phylogenetic relationships主要是通過序列比對來實現的,序列比對的工具有ClustalW,MUSCLE,Clustal Omega,Kalign,T-COFFEE,COBLAT和FastTree。目前針對16s rRNA NGS數據的分析工具都可以實現,如MEGA,RAxML,MRBAYES,PhyML,TreeView,Clearcut,FitTree。其中RAxMLand PhyML are the most widely used programs for maximum-likelihood phylogenetic analysis,probably because they are specifically designed and optimized for such purpose。
alpha多樣性有眾多指標可以表示,在mothur中有Shannon,Berger-Parker,Simpson,Q statistic;observed richness,Chao1,ACE,and jackknife。而在QIIME中,有phylogenetic diversity(PD)-whole tree,chao1,and observed species。
還有一種物種豐度的比較方法:rarefaction curve。QIIME中主要用「single_rarefaction.py」、 「multiple_rarefaction.py」,在mothur中主要用「Rarefaction.single」和「Rarefaction.shared」。
beta多樣性計算主要反映不同樣本之間的差異度,several distance metrics,such as Unifrac,Bray-Curtis,Euclidean,Jaccard index,Yue & Clayton,and Morisita-Horn,have been often employed。beta多樣性計算根據是否考慮OTU的相對豐度,可分為定量指數和定性指數。
在Two-sample/group中,多考慮t-test。在其中需要注意,Particularly for independent two-sample t-test, independence and equal variances(which canbe tested by F-test,Levene’s test,etc.)of two populations arerequired. In the case of non-normal distribution of data sets,nonparametric two-sample tests robust to data non-normality,such as Wilcoxon signed-rank test,and Mann-Whitney U testare applicable for significance testing of difference betweengroup medians。
在Multiple-sample/group tests中,用ANOVA。
clustering可以分析樣品之間的親疏關係。classfication的策略用來對樣品進行類別判定。
在樣本的相似度和距離計算完後,可以利用principal component analysis(PCA),principal coordinates analysis(PCoA,also known as metric multidimensional scaling),Nonmetric multidimensional scaling(NMDS),canonical correspondence analysis(CCA),linear discriminantanalysis(LDA),and redundancy analysis(RDA)等構建樣本間的關係。
與基因表達、代謝分子、蛋白等數據一起分析共表達網路或者共表達模式(co-occurrence and co-exclusion patterns)。
參考文獻:JuF, ZhangT. 16s rRNA gene high throughput sequencing data mining of microbiota diversity and interactions, Appl Microbiol Biotechnol. 2015, 99(10):4119-4129
這篇文章還是寫得很不錯的,小編今天分享給大家,有興趣的同學可以複製下方地址查看原文,就醬啦~
http://blog.sciencenet.cn/blog-306699-933182.html