DNA 甲基化(BS-seq)分析軟體之 HOME

2021-02-21 獅山生信

HOME: a histogram based machine learning approach for effective identification of differentially methylated regions[1]

雖然軟體發的期刊影響因子不高，但是是 Ryan lister 實驗室成員發表的。此實驗室 [2] 發表過不少高水平期刊文章。Genome Research、Nature Communications、Genome Biology、Current Opinion in Plant Biology、elife 等等。其中此文章的一作 Akanksha Srivastava 發表期刊如下：

Eichten SR, Stuart T, Srivastava A, Lister R, Borevitz JO (2016) *DNA methylation profiles of diverse Brachypodium distachyon aligns with underlying genetic diversity. *Genome Research 10.1101/gr.205468.116. [pdf](https://genome.cshlp.org/content/26/11/1520.full.pdf+html[3]), pubmed[4]

Narsai R#, Gouil Q, Secco D, Srivastava A, Karpievitch YV, Liew LC, Lister R, Lewsey MG#, Whelan J (2017) *Extensive transcriptomic and epigenomic remodelling occurs during Arabidopsis thaliana germination. *Genome Biology 10.1186/s13059-017-1302-3[5], pdf[6]

Srivastava A, Karpievitch YV, Eichten SR, Borevitz JO, Lister R (2019) *HOME: A histogram based machine learning approach for effective identification of differentially methylated regions. *BMC Bioinformatics 10.1186/s12859-019-2845-y[7], pdf[8]

不難看出此一作主要是從事分析 DNA 甲基化這一塊工作的，常年與其他人合作（感覺有點慘，分析的大都就是這樣與人合作基本拿不到排名第一的一作）。此工具就是將其之前發表的兩篇文章中的內容整個起來的，包括 16 年 Genome Research 中的 DNA 甲基化的時間序列分析。

把文章看了一遍，我覺得有必要把這一段內容先貼出來 HOME

HOME（histogram of methylation）是用於鑑別差異 DNA 甲基化區域（DMR）鑑定的 python包。該方法使用甲基化特徵的直方圖和線性支持向量機（SVM）來從全基因組亞硫酸氫鹽測序（WGBS）數據中鑑定 DMR。HOME 可以識別成對和時間序列 DMR(有或沒有重複都可以)。

安裝

HOME 是為 python 2.7 編寫的，並在 Linux 系統上進行了測試。建議在安裝 HOME 包之前先為 python 2.7 設置虛擬環境。作者這裡提供了怎麼創建虛擬環境，哈哈，不是通過 conda 而是 **virtualenv[9]**(virtualenv is a tool to create isolated Python environments)

virtualenv -p <path_to_python2.7> <env_name>

source <env_name>/bin/activate

pip install git+https://github.com/ListerLab/HOME.git

or

git clone https://github.com/ListerLab/HOME.git
cd ./HOME
pip install -r requirements.txt
python setup.py install

git clone https://github.com/ListerLab/HOME.git
cd ./HOME
conda env create *assuming the conda environment is activated and R is already installed in it*
source activate HOMEenv
python setup.py install

輸入文件

# 染色體名位點鏈方向 DNA甲基化類型甲基化的C的數目總的胞嘧啶C的數目
# Chromosome number, position, strand, type (CG/CHG/CHH) where H is anything but G, methylated reads and total number of reads.
chr1 15814 + CG 12 14
chr1 15815 - CG 15 21
chr1 15816 - CHG 1 9
chr1 15821 - CHH 7 22
chr1 15823 - CHH 0 2
chr1 15825 - CHH 11 19

需要注意，作者的說明書有點不用心啊。是 sample_file_CG.txt 而不是 csv 結尾，並且裡面作者提供的是如下，我們需要將其改為全路徑。sed 's/\.\//\/public\/home\/qliu\/biosoft\/HOME\//g' sample_file_CG.txt

sample1 ./testcase/CG/sample1_r1.txt    ./testcase/CG/sample1_r2.txt
sample2 ./testcase/CG/sample2_r1.txt    ./testcase/CG/sample2_r2.txt
sample3 ./testcase/CG/sample3_r1.txt    ./testcase/CG/sample3_r2.txt

sample1 /public/home/qliu/biosoft/HOME/testcase/CG/sample1_r1.txt       /public/home/qliu/biosoft/HOME/testcase/CG/sample1_r2.txt
sample2 /public/home/qliu/biosoft/HOME/testcase/CG/sample2_r1.txt       /public/home/qliu/biosoft/HOME/testcase/CG/sample2_r2.txt
sample3 /public/home/qliu/biosoft/HOME/testcase/CG/sample3_r1.txt       /public/home/qliu/biosoft/HOME/testcase/CG/sample3_r2.txt

尋找差異 DMR

HOME-pairwise -t [CG/CHG/CHH/CHN/CNN] -i [sample_file_fullpath] -o [output_directorypath]

運行路徑一定要在HOME文件目錄下即:/public/home/qliu/biosoft/HOME/，否則會報錯 Fatal error: cannot open file './scripts/HOME_R.R': No such file or directory, 不知道是我沒掌握精髓還是這個太。。

HOME-pairwise -t CG -i /public/home/qliu/biosoft/HOME/testcase/sample_file_CG.txt -o ./pairwise_CG_outputpath
# 運行結果
Preparing the DMRs from HOME
GOOD LUCK !
DMRs for sample1_VS_sample2_10 done
DMRs for sample1_VS_sample2_13 done
DMRs for sample1_VS_sample2_12 done
DMRs for sample1_VS_sample3_10 done
DMRs for sample1_VS_sample3_12 done
DMRs for sample1_VS_sample3_13 done
DMRs for sample2_VS_sample3_12 done
DMRs for sample2_VS_sample3_10 done
DMRs for sample2_VS_sample3_13 done

HOME 的默認參數相對寬鬆。要以更嚴格的設置運行 HOME，請將默認參數更改為以下或更高：

HOME-pairwise -t CG -i /public/home/qliu/biosoft/HOME/testcase/sample_file_CG.txt -o ./pairwise_stringent_CG_outputpath --delta 0.2 --minc 5
# 運行結果
Preparing the DMRs from HOME
GOOD LUCK !
DMRs for sample1_VS_sample2_13 done
DMRs for sample1_VS_sample2_10 done
DMRs for sample1_VS_sample2_12 done
DMRs for sample1_VS_sample3_13 done
DMRs for sample1_VS_sample3_10 done
DMRs for sample1_VS_sample3_12 done
DMRs for sample2_VS_sample3_13 done
DMRs for sample2_VS_sample3_10 done
DMRs for sample2_VS_sample3_12 done
Congratulations the DMRs are ready

結果比較，好吧這裡沒區別，應該是示例數據問題。但是這裡很好的是運行過程中是分染色體分別進行運行，加快了運行進程。

# 由於示例文件只提供了三條染色體的甲基化信息所以後面只有三條染色體的結果：
[CG]cut -f 1 sample1_r1.txt |sort |uniq -c
   2998 10
   2999 12
   2999 13

[sample1_VS_sample2]$ wc -l ./*
  3 ./HOME_DMRs_10.txt
  3 ./HOME_DMRs_12.txt
  3 ./HOME_DMRs_13.txt
  9 total

[sample1_VS_sample2]$ wc -l ./*
  3 ./HOME_DMRs_10.txt
  3 ./HOME_DMRs_12.txt
  3 ./HOME_DMRs_13.txt
  9 total

# pairwise_stringent_CG_outputpath：HOME_DMRs_12.txt (END)
chr     start   end     status  numC    mean_Meth1      mean_Meth2      delta   avg_coverage1   avg_coverage2   len
12      122818  123202  hyper   14      0.9421879271112584      0.1152926299222629      0.8268952971889956      60      60      384
12      179697  180225  hyper   9       0.9698766175534637      0.12380942497017339     0.8460671925832903      58      62      528

# HOME_pairwise_DMRs：HOME_DMRs_12.txt (END)
chr     start   end     status  numC    mean_Meth1      mean_Meth2      delta   avg_coverage1   avg_coverage2   len
12      122818  123202  hyper   14      0.9421879271112584      0.1152926299222629      0.8268952971889956      60      60      384
12      179697  180225  hyper   9       0.9698766175534637      0.12380942497017339     0.8460671925832903      58      62      528

還可以進行時間序列 DNA 甲基化分析

這裡，Max_delta 是比較樣品中的最大平均甲基化差異。置信度得分考慮了 C的長度，數量和 SVM得分。值越高表示 DMR越自信，也就是可信度越高。Comb1-n 表示每種樣品組合的成對比較。它報告 start：end：state：每個成對比較的 delta。

[HOME]$ HOME-timeseries  -t CG -i /public/home/qliu/biosoft/HOME/testcase/sample_file_CG.txt -o ./time_CG_outputpath
Preparing the DMRs from HOME
GOOD LUCK !
DMRs for 13 done
DMRs for 10 done
DMRs for 12 done
Congratulations the DMRs are ready

# 可以看到此文件中包含了兩兩相互比較的三種情況得到的結果。
# Timeseries_HOME_DMR_10.txt (END)
chr     start   end     numC    len     max_delta       confidence_scores       sample1_VS_sample2      sample1_VS_sample3      sample2_VS_sample3
10      122795  123264  17      469     0.6694115959991053      0.3984762292738861      122795:123264:hypo:0.669        122795:123264:hypo:0.353        122795:123264:hyper:0.318
10      179697  180278  11      581     0.7046148506166738      0.3052977823501635      179697:180278:hypo:0.705        179697:180278:hypo:0.348        179697:180278:hyper:0.36

參數翻譯 - -

HOME-pairwise

參數默認功能-sc --scorecutoff0.1每個胞嘧啶 C 位點的分類打分-p --pruncutoff0.1從兩端檢查連續 Cs 的 SVM 分數以細化邊界-npp -–numprocess8運行的線程數-ml --minlength50輸出的 DMR 的最小長度-ncb --numcb5輸出的 DMR 分離時所需要包含的少胞嘧啶 C 的數目-md -–mergedist500如果兩個 DMR 之間的舉例小於 500 就合併為一個 DMR-prn --prunningC3number of consecutives Cs to be considered for pruning for boundary refinement2（翻譯拿不準）-ns --numsamplesall被用來計算 DMR 的樣本，默認是所有-sp --startposition1st position在進行 timeseries DMR 計算時候放置為第一個的樣本開始位置-BSSeeker2 --BSSeeker2Falseinput CGmap file from BSSeeker2-mc --minc3一個 DMR 中最少包含的胞嘧啶 C 的數目-sin --singlechromFalse單染色體的並行計算；npp 將用於每條染色體的並行運行-d --delta0.1得到 DMR 的最小要求的平均 DNA 甲基化差異-wrt --withrespecttoall用於 DMR 的樣品要求與特定樣品成對比較-Keepall --KeepallFalse保持所有胞嘧啶位置存在於至少一個重複中看到BSSeeker2我是不咋喜歡的，要不是參數說明太詳細了，畢竟我是bismark死忠粉。

HOME-timeseries

參數默認功能-sc --scorecutoff0.5每個胞嘧啶 C 位點的分類打分-npp -–numprocess5運行的線程數-ml --minlength50輸出的 DMR 的最小長度-ns --numsamplesall被用來計算 DMR 的樣本，默認是所有-sp --startposition1st position在進行 timeseries DMR 計算時候放置為第一個的樣本開始位置-BSSeeker2 --BSSeeker2Falseinput CGmap file from BSSeeker2-mc --minc4一個 DMR 中最少包含的胞嘧啶 C 的數目-d --delta0.1得到 DMR 的最小要求的平均 DNA 甲基化差異-sin --singlechromFalse單染色體的並行計算；npp 將用於每條染色體的並行運行-Keepall --KeepallFalse保持所有胞嘧啶位置存在於至少一個重複中從參數來看一般默認即可。參考資料[1]

HOME: a histogram based machine learning approach for effective identification of differentially methylated regions: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2845-y

[2]

此實驗室: http://listerlab.org/publications.html

[3]

pdf: https://genome.cshlp.org/content/26/11/1520.full.pdf+html

[4]

pubmed: http://www.ncbi.nlm.nih.gov/pubmed/?term=DNA+methylation+profiles+of+diverse+Brachypodium+distachyon+aligns+with+underlying+genetic+diversity

[5]

10.1186/s13059-017-1302-3: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1302-3

[6]

pdf: https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-017-1302-3?site=genomebiology.biomedcentral.com

[7]

10.1186/s12859-019-2845-y: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2845-y

[8]

pdf: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-019-2845-y

[9]

virtualenv: https://virtualenv.pypa.io/en/latest/

DNA 甲基化(BS-seq)分析軟體之 HOME

相關焦點

DNA及RNA甲基化數據分析、課題設計與基金寫作學習班學習班 2020年...

DNA甲基化研究方法(測序法)

12月在線資料庫匯總|資料庫|甲基化|DNA|標誌物|預測|分析|-健康界

基因研究新突破:科學家揭開DNA甲基化之謎

中藥調控DNA甲基化修飾的研究思路

Nature:基於DNA甲基化的早期癌症檢測

DNA甲基化——甲基化DNA PCR & NGS 分析試劑盒

【表觀遺傳學基礎】BS技術生物信息分析流程及原理

朱冰課題組揭示DNA甲基化維持的動態過程和調控等機制

6mA甲基化-DNA甲基化研究新熱點

生物物理所揭示DNA甲基化維持的動態過程和調控等機制

科學網—新單細胞DNA甲基化測序技術可應用於不同物種

m6A(RNA甲基化修飾)課題思路介紹及熱點方向分析

北京基因組所發現RNA m6A選擇性甲基化機制

開發出cfMeDIP–seq技術,最低僅需1ng血漿cfDNA就可...

10種DNA甲基化測序技術,你知道幾個?

生物物理所揭示DNA甲基化維持的動態過程和調控機制及其對衰老和...

Alpha助力DNA甲基化表型調控新發現

基於納米孔測序開發同步檢測CpG甲基化和染色質可及性的新方法

RNA m6A選擇性甲基化機制獲進展