systemPipeR 1.19.0
背景:什麼是 ChIP-seq1 簡單介紹1.1 背景和目標1.2 實驗設計2 工作環境2.1 生成工作流程環境2.2 運行工作流程2.2.1 在計算節點運行 R3 Read 預處理3.1 targets 文件提供的實驗定義3.2 Read 質量篩選與修剪3.3 FASTQ 質量報告4 比對4.1 使用 Bowtie2 進行 Reads 的比對4.2 Reads 和比對信息統計4.3 創建用於在 IGV 中查看 BAM 文件的連結5 coverage 數據的實用程序( Utilities )5.1 Rle 屬性中儲存的 coverage 信息5.2 調整已比對上的 reads5.3 Naive Peak calling5.4 繪製定義區間的 coverage 信息6 使用 MACS2 進行 Peak calling6.1 在進行 Peak calling 之前將每個樣本的重複文件 BAM 合併成一個 BAM 文件。6.2 沒有 input/reference 的 Peak calling6.3 有 input/reference 的 Peak calling6.4 鑑定共同的 Peaks7 用基因組背景注 Peaks7.1 使用 ChIPpeakAnno 包進行注釋7.2 使用 ChIPseeker 包來注釋8 交集的 Peaks 的 reads 計數9 差異結合分析10 GO富集分析11 Motif 分析11.1 從基因組中獲得 Peaks 區間的DNA序列11.2 使用 BCRANK 發現 Motif12 版本信息13 基金參考文獻
背景:什麼是 ChIP-seq根據維基百科解釋, ChIP-seq 全稱為 Chromatin immunoprecipitation followed by sequencing , 中譯為染色質免疫沉澱-測序,是用來分析蛋白質與DNA的交互作用。該技術將染色質免疫沉澱(ChIP)與大規模的 DNA 測序結合起來以鑑定在全基因組上與 DNA 相關蛋白的結合部位。可以被用於精確繪製任意蛋白(當然有的蛋白的 ChIP 實驗由於抗體或者實驗材料原因並不好做)在全基因組上的結合位點。
我們知道了 ChIP-seq 是什麼以及用來做什麼了,那麼第一步的 ChIP-seq 實驗到建庫是怎樣的?
ChIP-seq 實驗包括染色質免疫沉澱實驗和高通量測序兩個部分。
首先將染色質結合的蛋白質與DNA通過甲醛交聯,從而可以得到蛋白質 - DNA 複合物。隨後,通過超聲波將其隨機打斷成一定長度範圍內的染色質小片段(一般為 200bp 左右)。然後通過免疫沉澱出目標蛋白質 - DNA 複合體,從而特異性地富集與目標蛋白結合的 DNA 片段。通過對目標蛋白的純化與檢測,最後進行建庫測序。
注意:ChIP-seq 也可以在沒有交聯的情況下進行,這通常被稱為 native ChIP。在這種情況下,染色質被微球菌核酸酶 (MNase)消化以達到所需的DNA片段大小。完全消化後得到147bp的片段,相當於單個核小體周圍的DNA長度。native ChIP 通常用於組蛋白修飾,但當交聯將掩蓋抗體識別的表位時候也用於其他因素。然而,對於與染色質鬆散結合的因子,不建議使用 native ChIP ,因為它們與DNA的接觸可能在流程過程中丟失。
一張圖說明
1library(systemPipeRdata)
2genWorkenvir(workflow = "chipseq")
3setwd("chipseq")
1Rscript -e "systemPipeRdata::genWorkenvir(workflow='chipseq')"
在 genWorkenvir 函數生成的工作流環境中,所有數據輸入都存儲在 data/ 目錄中,所有分析結果將寫入單獨的 result/ 目錄,而 systemPipeChIPseq.Rmd 腳本和 targets 文件應位於父目錄中。R 從此父目錄運行。其他參數文件存儲在 param / 下。
為了處理真實數據,用戶希望類似地組織自己的數據,並將所有測試數據替換為自己的數據。要在新數據上重新運行已建立的工作流,初始 targets 文件以及相應的 FASTQ 文件通常是用戶需要提供的唯一輸入。
2.2 運行工作流程2.2.1 在計算節點運行 R1q("no")
1srun --x11 --partition=short --mem=2gb --cpus-per-task 4 --ntasks 1 --time 2:00:00 --pty bash -l
2module load R/3.4.2
3R
1system("hostname")
2getwd()
3dir()
1library(systemPipeR)
1source("systemPipeChIPseq_Fct.R")
1targetspath <- system.file("extdata", "targets_chip.txt", package = "systemPipeR")
2targets <- read.delim(targetspath, comment.char = "#")
3targets[1:4, -c(5, 6)]
1 FileName SampleName Factor SampleLong SampleReference
21 ./data/SRR446027_1.fastq M1A M1 Mock.1h.A
32 ./data/SRR446028_1.fastq M1B M1 Mock.1h.B
43 ./data/SRR446029_1.fastq A1A A1 Avr.1h.A M1A
54 ./data/SRR446030_1.fastq A1B A1 Avr.1h.B M1B
1args <- systemArgs(sysma = "param/trim.param", mytargets = "targets_chip.txt")
2filterFct <- function(fq, cutoff = 20, Nexceptions = 0) {
3 qcount <- rowSums(as(quality(fq), "matrix") <= cutoff, na.rm = TRUE)
4 fq[qcount <= Nexceptions]
5
6
7}
8preprocessReads(args = args, Fct = "filterFct(fq, cutoff=20, Nexceptions=0)",
9 batchsize = 1e+05)
10writeTargetsout(x = args, file = "targets_chip_trim.txt", overwrite = TRUE)
1args <- systemArgs(sysma = "param/tophat.param", mytargets = "targets_chip.txt")
2library(BiocParallel)
3library(batchtools)
4f <- function(x) {
5 library(systemPipeR)
6 args <- systemArgs(sysma = "param/tophat.param", mytargets = "targets_chip.txt")
7 seeFastq(fastq = infile1(args)[x], batchsize = 1e+05, klength = 8)
8}
9moduleload(modules(args))
10resources <- list(walltime = 120, ntasks = 1, ncpus = cores(args),
11 memory = 1024)
12param <- BatchtoolsParam(workers = 4, cluster = "slurm", template = "batchtools.slurm.tmpl",
13 resources = resources)
14fqlist <- bplapply(seq(along = args), f, BPPARAM = param)
15
16pdf("./results/fastqReport.pdf", height = 18, width = 4 * length(fqlist))
17seeFastqPlot(unlist(fqlist, recursive = FALSE))
18dev.off()
1rgs <- systemArgs(sysma = "param/bowtieSE.param", mytargets = "targets_chip_trim.txt")
2
3sysargs(args)[1] # 第一個 FASTQ 文件的命令行參數
4
5moduleload(modules(args)) # 如果不需要 module 則可以跳過
6
7system("bowtie2-build ./data/tair10.fasta ./data/tair10.fasta")
8# 對基因組文件建立索引
9
10resources <- list(walltime = 120, ntasks = 1, ncpus = cores(args),
11 memory = 1024)
12
13reg <- clusterRun(args, conffile = ".batchtools.conf.R", Njobs = 18,
14 template = "batchtools.slurm.tmpl", runid = "01", resourceList = resources)
15
16getStatus(reg = reg)
17
18waitForJobs(reg = reg)
19
20writeTargetsout(x = args, file = "targets_bam.txt", overwrite = TRUE)
21
1runCommandline(args)
1file.exists(outpaths(args))
1read_statsDF <- alignStats(args = args)
2
3write.table(read_statsDF, "results/alignStats.xls", row.names = FALSE,
4 quote = FALSE, sep = "\t")
5
6read.delim("results/alignStats.xls")
1symLink2bam(sysargs = args, htmldir = c("~/.html/", "somedir/"),
2 urlbase = "http://biocluster.ucr.edu/~tgirke/", urlfile = "./results/IGVurl.txt")
1library(rtracklayer)
2library(GenomicRanges)
3library(Rsamtools)
4library(GenomicAlignments)
5
6aligns <- readGAlignments(outpaths(args)[1])
7cov <- coverage(aligns)
8cov
1trim(resize(as(aligns, "GRanges"), width = 200))
1islands <- slice(cov, lower = 15)
2islands[[1]]
1library(ggbio)
2myloc <- c("Chr1", 1, 1e+05)
3ga <- readGAlignments(outpaths(args)[1], use.names = TRUE, param = ScanBamParam(which = GRanges(myloc[1],
4 IRanges(as.numeric(myloc[2]), as.numeric(myloc[3])))))
5autoplot(ga, aes(color = strand, fill = strand), facets = strand ~
6 seqnames, stat = "coverage")
7
1args <- systemArgs(sysma = NULL, mytargets = "targets_bam.txt")
2
3args_merge <- mergeBamByFactor(args, overwrite = TRUE)
4
5writeTargetsout(x = args_merge, file = "targets_mergeBamByFactor.txt",
6 overwrite = TRUE)
7
1args <- systemArgs(sysma = "param/macs2_noinput.param", mytargets = "targets_mergeBamByFactor.txt")
2
3sysargs(args)[1] # 第一個 FASTQ 文件的命令行參數
4
5runCommandline(args)
6file.exists(outpaths(args))
7
8writeTargetsout(x = args, file = "targets_macs.txt", overwrite = TRUE)
9
1writeTargetsRef(infile = "targets_mergeBamByFactor.txt", outfile = "targets_bam_ref.txt",
2 silent = FALSE, overwrite = TRUE)
3
4args_input <- systemArgs(sysma = "param/macs2.param", mytargets = "targets_bam_ref.txt")
5
6sysargs(args_input)[1] # 第一個 FASTQ 文件的命令行參數
7
8runCommandline(args_input)
9file.exists(outpaths(args_input))
10
11writeTargetsout(x = args_input, file = "targets_macs_input.txt",
12 overwrite = TRUE)
1source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/rangeoverlapper.R")
2
3peak_M1A <- outpaths(args)["M1A"]
4
5peak_M1A <- as(read.delim(peak_M1A, comment = "#")[, 1:3], "GRanges")
6
7peak_A1A <- outpaths(args)["A1A"]
8
9peak_A1A <- as(read.delim(peak_A1A, comment = "#")[, 1:3], "GRanges")
10
11(myol1 <- subsetByOverlaps(peak_M1A, peak_A1A, minoverlap = 1))
12
13
14myol2 <- olRanges(query = peak_M1A, subject = peak_A1A, output = "gr")
15
16
17myol2[values(myol2)["OLpercQ"][, 1] >= 50]
18
1library(ChIPpeakAnno)
2library(GenomicFeatures)
3
4args <- systemArgs(sysma = "param/annotate_peaks.param", mytargets = "targets_macs.txt")
5
6
7txdb <- makeTxDbFromGFF(file = "data/tair10.gff", format = "gff",
8 dataSource = "TAIR", organism = "Arabidopsis thaliana")
9
10ge <- genes(txdb, columns = c("tx_name", "gene_id", "tx_type"))
11
12for (i in seq(along = args)) {
13 peaksGR <- as(read.delim(infile1(args)[i], comment = "#"),
14 "GRanges")
15 annotatedPeak <- annotatePeakInBatch(peaksGR, AnnotationData = genes(txdb))
16 df <- data.frame(as.data.frame(annotatedPeak), as.data.frame(values(ge[values(annotatedPeak)$feature,
17 ])))
18 write.table(df, outpaths(args[i]), quote = FALSE, row.names = FALSE,
19 sep = "\t")
20}
21
22writeTargetsout(x = args, file = "targets_peakanno.txt", overwrite = TRUE)
1library(ChIPseeker)
2
3for (i in seq(along = args)) {
4 peakAnno <- annotatePeak(infile1(args)[i], TxDb = txdb, verbose = FALSE)
5 df <- as.data.frame(peakAnno)
6 write.table(df, outpaths(args[i]), quote = FALSE, row.names = FALSE,
7 sep = "\t")
8}
9
10writeTargetsout(x = args, file = "targets_peakanno.txt", overwrite = TRUE)
1peak <- readPeakFile(infile1(args)[1])
2
3covplot(peak, weightCol = "X.log10.pvalue.")
4peakHeatmap(outpaths(args)[1], TxDb = txdb, upstream = 1000,
5 downstream = 1000, color = "red")
6
7plotAvgProf2(outpaths(args)[1], TxDb = txdb, upstream = 1000,
8 downstream = 1000, xlab = "Genomic Region (5'->3')", ylab = "Read Count Frequency")
9
1library(GenomicRanges)
2
3args <- systemArgs(sysma = "param/count_rangesets.param", mytargets = "targets_macs.txt")
4
5args_bam <- systemArgs(sysma = NULL, mytargets = "targets_bam.txt")
6
7bfl <- BamFileList(outpaths(args_bam), yieldSize = 50000, index = character())
8
9countDFnames <- countRangeset(bfl, args, mode = "Union", ignore.strand = TRUE)
10
11writeTargetsout(x = args, file = "targets_countDF.txt", overwrite = TRUE)
12
runDiff 函數使用 edgeR 或 DESeq2 以批處理模式對多個計數表執行差異結合分析( Robinson,McCarthy, Smyth 2010; Love,Huber, Anders 2014)。在內部,它調用函數 run_edgeR 和 run_DESeq2 。它還使用 dbrfilter 參數下提供的 Fold change 和 FDR 閾值 cutoff 返回下遊 filterDEGs 函數的過濾結果和圖。
1args_diff <- systemArgs(sysma = "param/rundiff.param", mytargets = "targets_countDF.txt")
2
3cmp <- readComp(file = args_bam, format = "matrix")
4
5dbrlist <- runDiff(args = args_diff, diffFct = run_edgeR, targets = targetsin(args_bam),
6 cmp = cmp[[1]], independent = TRUE, dbrfilter = c(Fold = 2,
7 FDR = 1))
8
9writeTargetsout(x = args_diff, file = "targets_rundiff.txt",
10 overwrite = TRUE)
11
1args <- systemArgs(sysma = "param/macs2.param", mytargets = "targets_bam_ref.txt")
2
3args_anno <- systemArgs(sysma = "param/annotate_peaks.param",
4 mytargets = "targets_macs.txt")
5
6annofiles <- outpaths(args_anno)
7
8gene_ids <- sapply(names(annofiles), function(x) unique(as.character(read.delim(annofiles[x])[,
9 "gene_id"])), simplify = FALSE)
10
11load("data/GO/catdb.RData")
12
13BatchResult <- GOCluster_Report(catdb = catdb, setlist = gene_ids,
14 method = "all", id_type = "gene", CLSZ = 2, cutoff = 0.9,
15 gocats = c("MF", "BP", "CC"), recordSpecGO = NULL)
16
1library(Biostrings)
2library(seqLogo)
3library(BCRANK)
4
5args <- systemArgs(sysma = "param/annotate_peaks.param", mytargets = "targets_macs.txt")
6
7rangefiles <- infile1(args)
8
9for (i in seq(along = rangefiles)) {
10 df <- read.delim(rangefiles[i], comment = "#")
11 peaks <- as(df, "GRanges")
12
13 names(peaks) <- paste0(as.character(seqnames(peaks)), "_",
14 start(peaks), "-", end(peaks))
15
16 peaks <- peaks[order(values(peaks)$X.log10.pvalue., decreasing = TRUE)]
17
18 pseq <- getSeq(FaFile("./data/tair10.fasta"), peaks)
19 names(pseq) <- names(peaks)
20 writeXStringSet(pseq, paste0(rangefiles[i], ".fasta"))
21}
1set.seed(0)
2
3BCRANKout <- bcrank(paste0(rangefiles[1], ".fasta"), restarts = 25,
4 use.P1 = TRUE, use.P2 = TRUE)
5
6toptable(BCRANKout)
7
8topMotif <- toptable(BCRANKout, 1)
9
10weightMatrix <- pwm(topMotif, normalize = FALSE)
11
12weightMatrixNormalized <- pwm(topMotif, normalize = TRUE)
13
14pdf("results/seqlogo.pdf")
15seqLogo(weightMatrixNormalized)
16dev.off()
1sessionInfo()
1#
2#
3#
4#
5#
6#
7#
8#
9#
10#
11#
12#
13#
14#
15#
16#
17#
18#
19#
20#
21#
22#
23#
24#
25#
26#
27#
28#
29#
30#
31#
32#
33#
34#
35#
36#
37#
38#
39#
40#
41#
42#
43#
44#
45#
46#
47#
48#
49#
50#
51#
52#
53#
54#
55#
56#
57#
58#
59#
60#
61#
62#
63#
64#
65#
66#
67#
68#
69#
70#
71#
72#
73#
74#
75#
76#
梁芳,徐柯,龔朝建,李俏,馬健,熊煒,曾朝陽,李桂源.染色質免疫沉澱-測序:全基因組範圍研究蛋白質-DNA相互作用的新技術[J].生物化學與生物物理進展,2013,40(03):216-227. http://kns.cnki.net//KXReader/Detail?TIMESTAMP=636956914509048750&DBCODE=CJFQ&TABLEName=CJFD2013&FileName=SHSW201303004&RESULT=1&SIGN=2RGYhJSTVwsvrs6a%2fR6iTqtx38A%3d
Borbala Mifsud, Kathi Zarnack, Anaïs F Bardet. 2019. 「Practical Guide to ChIP-seq Data Analysis.」 International Standard Book Number-13: 978-1-138-59652-8. https://www.taylorfrancis.com/books/e/9780429487590
H Backman, Tyler W, and Thomas Girke. 2016. 「systemPipeR: NGS workflow and report generation environment.」 BMC Bioinformatics 17 (1):388. https://doi.org/10.1186/s12859-016-1241-0.
Langmead, Ben, and Steven L Salzberg. 2012. 「Fast Gapped-Read Alignment with Bowtie 2.」 Nat. Methods 9 (4). Nature Publishing Group:357–59. https://doi.org/10.1038/nmeth.1923.
Love, Michael, Wolfgang Huber, and Simon Anders. 2014. 「Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.」 Genome Biol. 15 (12):550. https://doi.org/10.1186/s13059-014-0550-8.
Robinson, M D, D J McCarthy, and G K Smyth. 2010. 「EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.」 Bioinformatics 26 (1):139–40. https://doi.org/10.1093/bioinformatics/btp616.
Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. 2015. 「ChIPseeker: An R/Bioconductor Package for ChIP Peak Annotation, Comparison and Visualization.」 Bioinformatics 31 (14):2382–3. https://doi.org/10.1093/bioinformatics/btv145.
Zhang, Y, T Liu, C A Meyer, J Eeckhoute, D S Johnson, B E Bernstein, C Nussbaum, et al. 2008. 「Model-Based Analysis of ChIP-Seq (MACS).」 Genome Biol. 9 (9). https://doi.org/10.1186/gb-2008-9-9-r137.
Zhu, Lihua J, Claude Gazin, Nathan D Lawson, Hervé Pagès, Simon M Lin, David S Lapointe, and Michael R Green. 2010. 「ChIPpeakAnno: A Bioconductor Package to Annotate ChIP-seq and ChIP-chip Data.」 BMC Bioinformatics 11 (11~may):237. https://doi.org/10.1186/1471-2105-11-237.