systemPipeR : 什麼?在 R 裡面就能做全套的 ChIP-seq ?

2021-03-02 高通量測序技術

R包

systemPipeR 1.19.0

內容目錄（我自己加的）

背景：什麼是 ChIP-seq1 簡單介紹1.1 背景和目標1.2 實驗設計2 工作環境2.1 生成工作流程環境2.2 運行工作流程2.2.1 在計算節點運行 R3 Read 預處理3.1 targets 文件提供的實驗定義3.2 Read 質量篩選與修剪3.3 FASTQ 質量報告4 比對4.1 使用 Bowtie2 進行 Reads 的比對4.2 Reads 和比對信息統計4.3 創建用於在 IGV 中查看 BAM 文件的連結5 coverage 數據的實用程序( Utilities )5.1 Rle 屬性中儲存的 coverage 信息5.2 調整已比對上的 reads5.3 Naive Peak calling5.4 繪製定義區間的 coverage 信息6 使用 MACS2 進行 Peak calling6.1 在進行 Peak calling 之前將每個樣本的重複文件 BAM 合併成一個 BAM 文件。6.2 沒有 input/reference 的 Peak calling6.3 有 input/reference 的 Peak calling6.4 鑑定共同的 Peaks7 用基因組背景注 Peaks7.1 使用 ChIPpeakAnno 包進行注釋7.2 使用 ChIPseeker 包來注釋8 交集的 Peaks 的 reads 計數9 差異結合分析10 GO富集分析11 Motif 分析11.1 從基因組中獲得 Peaks 區間的DNA序列11.2 使用 BCRANK 發現 Motif12 版本信息13 基金參考文獻

背景：什麼是 ChIP-seq

根據維基百科解釋, ChIP-seq 全稱為 Chromatin immunoprecipitation followed by sequencing , 中譯為染色質免疫沉澱-測序，是用來分析蛋白質與DNA的交互作用。該技術將染色質免疫沉澱（ChIP）與大規模的 DNA 測序結合起來以鑑定在全基因組上與 DNA 相關蛋白的結合部位。可以被用於精確繪製任意蛋白（當然有的蛋白的 ChIP 實驗由於抗體或者實驗材料原因並不好做）在全基因組上的結合位點。

我們知道了 ChIP-seq 是什麼以及用來做什麼了，那麼第一步的 ChIP-seq 實驗到建庫是怎樣的？

ChIP-seq 實驗包括染色質免疫沉澱實驗和高通量測序兩個部分。

首先將染色質結合的蛋白質與DNA通過甲醛交聯，從而可以得到蛋白質 - DNA 複合物。隨後，通過超聲波將其隨機打斷成一定長度範圍內的染色質小片段（一般為 200bp 左右）。然後通過免疫沉澱出目標蛋白質 - DNA 複合體，從而特異性地富集與目標蛋白結合的 DNA 片段。通過對目標蛋白的純化與檢測，最後進行建庫測序。

注意：ChIP-seq 也可以在沒有交聯的情況下進行，這通常被稱為 native ChIP。在這種情況下，染色質被微球菌核酸酶（MNase）消化以達到所需的DNA片段大小。完全消化後得到147bp的片段，相當於單個核小體周圍的DNA長度。native ChIP 通常用於組蛋白修飾，但當交聯將掩蓋抗體識別的表位時候也用於其他因素。然而，對於與染色質鬆散結合的因子，不建議使用 native ChIP ，因為它們與DNA的接觸可能在流程過程中丟失。

一張圖說明

A左：中文版 A右: ChIP-seq 的實驗原理圖 B：ChIP-seq數據分析流程1 簡單介紹 1.1 背景和目標1.2 實驗設計2 工作環境 2.1 生成工作流程環境

1library(systemPipeRdata)
2genWorkenvir(workflow = "chipseq")
3setwd("chipseq")

1Rscript -e "systemPipeRdata::genWorkenvir(workflow='chipseq')"

在 genWorkenvir 函數生成的工作流環境中，所有數據輸入都存儲在 data/ 目錄中，所有分析結果將寫入單獨的 result/ 目錄，而 systemPipeChIPseq.Rmd 腳本和 targets 文件應位於父目錄中。R 從此父目錄運行。其他參數文件存儲在 param / 下。

為了處理真實數據，用戶希望類似地組織自己的數據，並將所有測試數據替換為自己的數據。要在新數據上重新運行已建立的工作流，初始 targets 文件以及相應的 FASTQ 文件通常是用戶需要提供的唯一輸入。

2.2 運行工作流程2.2.1 在計算節點運行 R

1q("no")

1srun --x11 --partition=short --mem=2gb --cpus-per-task 4 --ntasks 1 --time 2:00:00 --pty bash -l
2module load R/3.4.2
3R

1system("hostname")
2getwd()
3dir()

1library(systemPipeR)

1source("systemPipeChIPseq_Fct.R")

3 Read 預處理 3.1 targets 文件提供的實驗定義

1targetspath <- system.file("extdata", "targets_chip.txt", package = "systemPipeR")
2targets <- read.delim(targetspath, comment.char = "#")
3targets[1:4, -c(5, 6)]

1                  FileName SampleName Factor SampleLong SampleReference
21 ./data/SRR446027_1.fastq        M1A     M1  Mock.1h.A
32 ./data/SRR446028_1.fastq        M1B     M1  Mock.1h.B
43 ./data/SRR446029_1.fastq        A1A     A1   Avr.1h.A             M1A
54 ./data/SRR446030_1.fastq        A1B     A1   Avr.1h.B             M1B

3.2 Read 質量篩選與修剪

1args <- systemArgs(sysma = "param/trim.param", mytargets = "targets_chip.txt")
2filterFct <- function(fq, cutoff = 20, Nexceptions = 0) {
3    qcount <- rowSums(as(quality(fq), "matrix") <= cutoff, na.rm = TRUE)
4    fq[qcount <= Nexceptions]
5
6
7}
8preprocessReads(args = args, Fct = "filterFct(fq, cutoff=20, Nexceptions=0)",
9    batchsize = 1e+05)
10writeTargetsout(x = args, file = "targets_chip_trim.txt", overwrite = TRUE)

3.3 FASTQ 質量報告

1args <- systemArgs(sysma = "param/tophat.param", mytargets = "targets_chip.txt")
2library(BiocParallel)
3library(batchtools)
4f <- function(x) {
5    library(systemPipeR)
6    args <- systemArgs(sysma = "param/tophat.param", mytargets = "targets_chip.txt")
7    seeFastq(fastq = infile1(args)[x], batchsize = 1e+05, klength = 8)
8}
9moduleload(modules(args))
10resources <- list(walltime = 120, ntasks = 1, ncpus = cores(args),
11    memory = 1024)
12param <- BatchtoolsParam(workers = 4, cluster = "slurm", template = "batchtools.slurm.tmpl",
13    resources = resources)
14fqlist <- bplapply(seq(along = args), f, BPPARAM = param)
15
16pdf("./results/fastqReport.pdf", height = 18, width = 4 * length(fqlist))
17seeFastqPlot(unlist(fqlist, recursive = FALSE))
18dev.off()

Figure 1：18 個樣本的 FASTQ 質量報告4 比對 4.1 使用 Bowtie2 進行 Reads 的比對

1rgs <- systemArgs(sysma = "param/bowtieSE.param", mytargets = "targets_chip_trim.txt")
2
3sysargs(args)[1]  # 第一個 FASTQ 文件的命令行參數
4
5moduleload(modules(args))  # 如果不需要 module 則可以跳過
6
7system("bowtie2-build ./data/tair10.fasta ./data/tair10.fasta")
8# 對基因組文件建立索引
9
10resources <- list(walltime = 120, ntasks = 1, ncpus = cores(args),
11    memory = 1024)
12
13reg <- clusterRun(args, conffile = ".batchtools.conf.R", Njobs = 18,
14    template = "batchtools.slurm.tmpl", runid = "01", resourceList = resources)
15
16getStatus(reg = reg)
17
18waitForJobs(reg = reg)
19
20writeTargetsout(x = args, file = "targets_bam.txt", overwrite = TRUE)
21

1runCommandline(args)

1file.exists(outpaths(args))

4.2 Reads 和比對信息統計

1read_statsDF <- alignStats(args = args)
2
3write.table(read_statsDF, "results/alignStats.xls", row.names = FALSE,
4 quote = FALSE, sep = "\t")
5
6read.delim("results/alignStats.xls")

4.3 創建用於在 IGV 中查看 BAM 文件的連結

1symLink2bam(sysargs = args, htmldir = c("~/.html/", "somedir/"),
2 urlbase = "http://biocluster.ucr.edu/~tgirke/", urlfile = "./results/IGVurl.txt")

5 coverage 數據的實用程序( Utilities ) 5.1 Rle 屬性中儲存的 coverage 信息

1library(rtracklayer)
2library(GenomicRanges)
3library(Rsamtools)
4library(GenomicAlignments)
5
6aligns <- readGAlignments(outpaths(args)[1])
7cov <- coverage(aligns)
8cov

5.2 調整已比對上的 reads

1trim(resize(as(aligns, "GRanges"), width = 200))

5.3 Naive Peak calling

1islands <- slice(cov, lower = 15)
2islands[[1]]

5.4 繪製定義區間的 coverage 信息

1library(ggbio)
2myloc <- c("Chr1", 1, 1e+05)
3ga <- readGAlignments(outpaths(args)[1], use.names = TRUE, param = ScanBamParam(which = GRanges(myloc[1],
4 IRanges(as.numeric(myloc[2]), as.numeric(myloc[3])))))
5autoplot(ga, aes(color = strand, fill = strand), facets = strand ~
6 seqnames, stat = "coverage")
7

6 使用 MACS2 進行 Peak calling 6.1 在進行 Peak calling 之前將每個樣本的重複文件 BAM 合併成一個 BAM 文件。

1args <- systemArgs(sysma = NULL, mytargets = "targets_bam.txt")
2
3args_merge <- mergeBamByFactor(args, overwrite = TRUE)
4
5writeTargetsout(x = args_merge, file = "targets_mergeBamByFactor.txt",
6 overwrite = TRUE)
7

6.2 沒有 input/reference 的 Peak calling

1args <- systemArgs(sysma = "param/macs2_noinput.param", mytargets = "targets_mergeBamByFactor.txt")
2
3sysargs(args)[1] # 第一個 FASTQ 文件的命令行參數
4
5runCommandline(args)
6file.exists(outpaths(args))
7
8writeTargetsout(x = args, file = "targets_macs.txt", overwrite = TRUE)
9

6.3 有 input/reference 的 Peak calling

1writeTargetsRef(infile = "targets_mergeBamByFactor.txt", outfile = "targets_bam_ref.txt",
2    silent = FALSE, overwrite = TRUE)
3
4args_input <- systemArgs(sysma = "param/macs2.param", mytargets = "targets_bam_ref.txt")
5
6sysargs(args_input)[1]  # 第一個 FASTQ 文件的命令行參數
7
8runCommandline(args_input)
9file.exists(outpaths(args_input))
10
11writeTargetsout(x = args_input, file = "targets_macs_input.txt",
12    overwrite = TRUE)

6.4 鑑定共同的 Peaks

1source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/rangeoverlapper.R")
2
3peak_M1A <- outpaths(args)["M1A"]
4
5peak_M1A <- as(read.delim(peak_M1A, comment = "#")[, 1:3], "GRanges")
6
7peak_A1A <- outpaths(args)["A1A"]
8
9peak_A1A <- as(read.delim(peak_A1A, comment = "#")[, 1:3], "GRanges")
10
11(myol1 <- subsetByOverlaps(peak_M1A, peak_A1A, minoverlap = 1))
12
13
14myol2 <- olRanges(query = peak_M1A, subject = peak_A1A, output = "gr")
15
16
17myol2[values(myol2)["OLpercQ"][, 1] >= 50]
18

7 用基因組背景注 Peaks 7.1 使用 ChIPpeakAnno 包進行注釋

1library(ChIPpeakAnno)
2library(GenomicFeatures)
3
4args <- systemArgs(sysma = "param/annotate_peaks.param", mytargets = "targets_macs.txt")
5
6
7txdb <- makeTxDbFromGFF(file = "data/tair10.gff", format = "gff",
8    dataSource = "TAIR", organism = "Arabidopsis thaliana")
9
10ge <- genes(txdb, columns = c("tx_name", "gene_id", "tx_type"))
11
12for (i in seq(along = args)) {
13    peaksGR <- as(read.delim(infile1(args)[i], comment = "#"),
14        "GRanges")
15    annotatedPeak <- annotatePeakInBatch(peaksGR, AnnotationData = genes(txdb))
16    df <- data.frame(as.data.frame(annotatedPeak), as.data.frame(values(ge[values(annotatedPeak)$feature,
17        ])))
18    write.table(df, outpaths(args[i]), quote = FALSE, row.names = FALSE,
19        sep = "\t")
20}
21
22writeTargetsout(x = args, file = "targets_peakanno.txt", overwrite = TRUE)

7.2 使用 ChIPseeker 包來注釋

1library(ChIPseeker)
2
3for (i in seq(along = args)) {
4    peakAnno <- annotatePeak(infile1(args)[i], TxDb = txdb, verbose = FALSE)
5    df <- as.data.frame(peakAnno)
6    write.table(df, outpaths(args[i]), quote = FALSE, row.names = FALSE,
7        sep = "\t")
8}
9
10writeTargetsout(x = args, file = "targets_peakanno.txt", overwrite = TRUE)

1peak <- readPeakFile(infile1(args)[1])
2
3covplot(peak, weightCol = "X.log10.pvalue.")
4peakHeatmap(outpaths(args)[1], TxDb = txdb, upstream = 1000,
5 downstream = 1000, color = "red")
6
7plotAvgProf2(outpaths(args)[1], TxDb = txdb, upstream = 1000,
8 downstream = 1000, xlab = "Genomic Region (5'->3')", ylab = "Read Count Frequency")
9

8 交集的 Peaks 的 reads 計數

1library(GenomicRanges)
2
3args <- systemArgs(sysma = "param/count_rangesets.param", mytargets = "targets_macs.txt")
4
5args_bam <- systemArgs(sysma = NULL, mytargets = "targets_bam.txt")
6
7bfl <- BamFileList(outpaths(args_bam), yieldSize = 50000, index = character())
8
9countDFnames <- countRangeset(bfl, args, mode = "Union", ignore.strand = TRUE)
10
11writeTargetsout(x = args, file = "targets_countDF.txt", overwrite = TRUE)
12

9 差異結合分析

runDiff 函數使用 edgeR 或 DESeq2 以批處理模式對多個計數表執行差異結合分析（ Robinson，McCarthy, Smyth 2010; Love，Huber, Anders 2014）。在內部，它調用函數 run_edgeR 和 run_DESeq2 。它還使用 dbrfilter 參數下提供的 Fold change 和 FDR 閾值 cutoff 返回下遊 filterDEGs 函數的過濾結果和圖。

1args_diff <- systemArgs(sysma = "param/rundiff.param", mytargets = "targets_countDF.txt")
2
3cmp <- readComp(file = args_bam, format = "matrix")
4
5dbrlist <- runDiff(args = args_diff, diffFct = run_edgeR, targets = targetsin(args_bam),
6    cmp = cmp[[1]], independent = TRUE, dbrfilter = c(Fold = 2,
7        FDR = 1))
8
9writeTargetsout(x = args_diff, file = "targets_rundiff.txt",
10    overwrite = TRUE)
11

10 GO富集分析

1args <- systemArgs(sysma = "param/macs2.param", mytargets = "targets_bam_ref.txt")
2
3args_anno <- systemArgs(sysma = "param/annotate_peaks.param",
4    mytargets = "targets_macs.txt")
5
6annofiles <- outpaths(args_anno)
7
8gene_ids <- sapply(names(annofiles), function(x) unique(as.character(read.delim(annofiles[x])[,
9    "gene_id"])), simplify = FALSE)
10
11load("data/GO/catdb.RData")
12
13BatchResult <- GOCluster_Report(catdb = catdb, setlist = gene_ids,
14    method = "all", id_type = "gene", CLSZ = 2, cutoff = 0.9,
15    gocats = c("MF", "BP", "CC"), recordSpecGO = NULL)
16

11 Motif 分析 11.1 從基因組中獲得 Peaks 區間的DNA序列

1library(Biostrings)
2library(seqLogo)
3library(BCRANK)
4
5args <- systemArgs(sysma = "param/annotate_peaks.param", mytargets = "targets_macs.txt")
6
7rangefiles <- infile1(args)
8
9for (i in seq(along = rangefiles)) {
10    df <- read.delim(rangefiles[i], comment = "#")
11    peaks <- as(df, "GRanges")
12
13    names(peaks) <- paste0(as.character(seqnames(peaks)), "_",
14        start(peaks), "-", end(peaks))
15
16    peaks <- peaks[order(values(peaks)$X.log10.pvalue., decreasing = TRUE)]
17
18    pseq <- getSeq(FaFile("./data/tair10.fasta"), peaks)
19    names(pseq) <- names(peaks)
20    writeXStringSet(pseq, paste0(rangefiles[i], ".fasta"))
21}

11.2 使用 BCRANK 發現 Motif

1set.seed(0)
2
3BCRANKout <- bcrank(paste0(rangefiles[1], ".fasta"), restarts = 25,
4 use.P1 = TRUE, use.P2 = TRUE)
5
6toptable(BCRANKout)
7
8topMotif <- toptable(BCRANKout, 1)
9
10weightMatrix <- pwm(topMotif, normalize = FALSE)
11
12weightMatrixNormalized <- pwm(topMotif, normalize = TRUE)
13
14pdf("results/seqlogo.pdf")
15seqLogo(weightMatrixNormalized)
16dev.off()

Figure 2: 被 BCRANK 鑑定的其中一個 motif12 版本信息

1sessionInfo()

1#
2#
3#
4#
5#
6#
7#
8#
9#
10#
11#
12#
13#
14#
15#
16#
17#
18#
19#
20#
21#
22#
23#
24#
25#
26#
27#
28#
29#
30#
31#
32#
33#
34#
35#
36#
37#
38#
39#
40#
41#
42#
43#
44#
45#
46#
47#
48#
49#
50#
51#
52#
53#
54#
55#
56#
57#
58#
59#
60#
61#
62#
63#
64#
65#
66#
67#
68#
69#
70#
71#
72#
73#
74#
75#
76#

13 基金參考文獻

梁芳,徐柯,龔朝建,李俏,馬健,熊煒,曾朝陽,李桂源.染色質免疫沉澱-測序:全基因組範圍研究蛋白質-DNA相互作用的新技術[J].生物化學與生物物理進展,2013,40(03):216-227. http://kns.cnki.net//KXReader/Detail?TIMESTAMP=636956914509048750&DBCODE=CJFQ&TABLEName=CJFD2013&FileName=SHSW201303004&RESULT=1&SIGN=2RGYhJSTVwsvrs6a%2fR6iTqtx38A%3d

Borbala Mifsud, Kathi Zarnack, Anaïs F Bardet. 2019. 「Practical Guide to ChIP-seq Data Analysis.」 International Standard Book Number-13: 978-1-138-59652-8. https://www.taylorfrancis.com/books/e/9780429487590

H Backman, Tyler W, and Thomas Girke. 2016. 「systemPipeR: NGS workflow and report generation environment.」 BMC Bioinformatics 17 (1):388. https://doi.org/10.1186/s12859-016-1241-0.

Langmead, Ben, and Steven L Salzberg. 2012. 「Fast Gapped-Read Alignment with Bowtie 2.」 Nat. Methods 9 (4). Nature Publishing Group:357–59. https://doi.org/10.1038/nmeth.1923.

Love, Michael, Wolfgang Huber, and Simon Anders. 2014. 「Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.」 Genome Biol. 15 (12):550. https://doi.org/10.1186/s13059-014-0550-8.

Robinson, M D, D J McCarthy, and G K Smyth. 2010. 「EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.」 Bioinformatics 26 (1):139–40. https://doi.org/10.1093/bioinformatics/btp616.

Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. 2015. 「ChIPseeker: An R/Bioconductor Package for ChIP Peak Annotation, Comparison and Visualization.」 Bioinformatics 31 (14):2382–3. https://doi.org/10.1093/bioinformatics/btv145.

Zhang, Y, T Liu, C A Meyer, J Eeckhoute, D S Johnson, B E Bernstein, C Nussbaum, et al. 2008. 「Model-Based Analysis of ChIP-Seq (MACS).」 Genome Biol. 9 (9). https://doi.org/10.1186/gb-2008-9-9-r137.

Zhu, Lihua J, Claude Gazin, Nathan D Lawson, Hervé Pagès, Simon M Lin, David S Lapointe, and Michael R Green. 2010. 「ChIPpeakAnno: A Bioconductor Package to Annotate ChIP-seq and ChIP-chip Data.」 BMC Bioinformatics 11 (11~may):237. https://doi.org/10.1186/1471-2105-11-237.

systemPipeR : 什麼?在 R 裡面就能做全套的 ChIP-seq ?

相關焦點

一篇文章學會ChIP-seq分析(上)

ChIP-Seq數據中包含了spike-in怎麼分析

一個ChIP-seq實戰

ChIP-seq經典文章(上)植物篇

Chip-seq分析:HOMER尋找motif

你知道a chip off the old block是什麼意思嗎?

【The Plant Cell 】玉米轉錄因子的RNA-seq和CHIP-seq聯合分析

R包ggseqlogo 繪製seq logo圖

英語chip in 的用法

九月學徒ChIP-seq學習成果展(6萬字總結)(下篇)

你知道chip off是什麼意思嗎?

Abstractive Text Summarization using Seq2Seq RNNs and Beyond

chip用法,耿耿於懷用英語怎麼說?a chip on your shoulder

RNA-seq從入門到自閉(Kallisto和Salmon)

ATAC-Seq Motif 富集分析

...engineers develop tiny, sound-powered chip to serve as...

人類血液樣本RNA-seq研究現狀

seq2seq 的 keras 實現

Words and their stories: A chip on your shoulder

ATAC-Seq 分析流程