比較了11種RNA-seq數據的差異表達分析方法。主要結果如下:
DESeq - Conservative with default settings. Becomes more conservative when outliers are introduced.
- Generally low TPR.
- Poor FDR control with 2 samples/condition, good FDR control for larger sample sizes, also with outliers.
- Medium computational time requirement, increases slightly with sample size.
edgeR - Slightly liberal for small sample sizes with default settings. Becomes more liberal when outliers are introduced.
- Generally high TPR.
- Poor FDR control in many cases, worse with outliers.
- Medium computational time requirement, largely independent of sample size.
NBPSeq - Liberal for all sample sizes. Becomes more liberal when outliers are introduced.
- Medium TPR.
- Poor FDR control, worse with outliers. Often truly non-DE genes are among those with smallest p-
values.
- Medium computational time requirement, increases slightly with sample size.
TSPM - Overall highly sample-size dependent performance.
- Liberal for small sample sizes, largely unaffected by outliers.
- Very poor FDR control for small sample sizes, improves rapidly with increasing sample size.
Largely unaffected by outliers.
- When all genes are overdispersed, many truly non-DE genes are among the ones with smallest p-
values. Remedied when the counts for some genes are Poisson distributed.
- Medium computational time requirement, largely independent of sample size.
voom / vst
- Good type I error control, becomes more conservative when outliers are introduced.
- Low power for small sample sizes. Medium TPR for larger sample sizes.
- Good FDR control except for simulation study B04000. Largely unaffected by introduction of outliers.
- Computationally fast.
baySeq - Highly variable results when all DE genes are regulated in the same direction. Less variability when the DE genes are regulated in different directions.
- Low TPR. Largely unaffected by outliers.
- Poor FDR control with 2 samples/condition, good for larger sample sizes in the absence of outliers. Poor FDR control in the presence of outliers.
- Computationally slow, but allows parallelization.
EBSeq - TPR relatively independent of sample size and presence of outliers.
- Poor FDR control in most situations, relatively unaffected by outliers.
- Medium computational time requirement, increases slightly with sample size.
NOISeq - Not clear how to set the threshold for qNOISeq to correspond to a given FDR threshold.
- Performs well, in terms of false discovery curves, when the dispersion is different between the
conditions (see supplementary material).
- Computational time requirement highly dependent on sample size.
SAMseq - Low power for small sample sizes. High TPR for large enough sample sizes.
- Performs well also for simulation study B04000.
- Largely unaffected by introduction of outliers.
- Computational time requirement highly dependent on sample size.
ShrinkSeq - Often poor FDR control, but allows the user to use also a fold change threshold in the inference procedure.
- High TPR.
- Computationally slow, but allows parallelization.
沒有哪種單獨的方法對所有情形都是最優的,特定情形下方法的選擇取決於實驗條件。本文評價的這些方法中,基於穩定方差的變換與limma組合的方法在很多情況下都表現不錯,而且不受例外點影響、計算很快,但是要求每條件下至少3個樣本來提供充分的檢定力。而且在兩條件下散度不同時表現更糟糕。非參數方法SAMseq在大樣本量時是性能最優的方法,需要至少每條件下4-5個樣本提供充分的檢定力。對於高表達基因,SAMseq的統計顯著性所需的倍數變化比很多其他方法要低,這可能潛在地折中了一些統計顯著的DEGs的生物學顯著性。對ShrinkSeq也是一樣,不過它有一個選項在推斷過程中強加一個倍數變化要求。
小樣本導致一些方法的誤報率遠超FDR閾值。對於參數方法,這可能是因為均值和方差估計不精確。TSPM受樣本量影響最大,可能因為它使用了漸進估計。儘管發展指向大樣本量,而且barcoding和multiplexing創造了固定成本分析更多樣本的機會,但是目前為止RNA-seq實驗仍然太貴而不允許廣泛的重複。本研究所傳達的結果強烈建議小樣本差異表達基因應該謹慎解釋,真實FDR可能超出所選FDR閾值數倍。
DESeq、edgeR和NBPSeq基於類似的原理,因此基因排序的精確度很類似。但是相同閾值選取出的DEGs有很大不同,這是因為它們估計散度參數的方法不同。在預設設置和合理的大樣本量下,DESeq通常過於保守而edgeR和NBPSeq通常過於慷慨而得出大量假DEGs。分析表明參數選擇影響很大,而且預設推薦參數事實上選擇的很好通常能得到最佳結果。
EBSeq、baySeq、ShrinkSeq使用了不同的推斷方法來估計每個基因差異表達的後驗概率。baySeq一些條件下表現不錯,但是高度可變,特別是所有基因都上調或都下調時。大樣本量條件下有異常值時,EBSeq比baySeq的假陽性低,小樣本量時baySeq比EBSeq的假陽性低。
原文:http://blog.sina.com.cn/s/blog_3eaf29360101n5lv.html
歡迎關注生信人