超能餅乾|SnapATAC分析單細胞ATAC-seq數據(三)

2021-02-20 科研貓

簡介

在本教程中，我們將對PBMC的兩個scATAC-seq數據集(5K和10K)和一個scRNA-seq數據集進行整合分析。這三個數據集均來自10X genomics測序平臺產生的數據，可以直接在10x官網下載使用。
具體來說，我們將主要執行以下分析內容：

Cell selection for PBMC 5k and 10k scATAC;

Randomly sample 10,000 cells as landmarks;

Unsupervised clustering of landmarks;

Project the remaining (query) cells onto the landmarks;

Supervised annotation of scATAC clusters using scRNA dataset;

Downstream analysis including peak calling, differential analysis, prediction of gene-enhancer pairing.

分析步驟

Step 0. Data download

Step 1. Barcode selection

Step 2. Add cell-by-bin matrix

Step 3. Matrix binarization

Step 4. Bin filtering

Step 5. Dimensionality reduction of landmarks

Step 6. Determine significant components

Step 7. Graph-based clustering

Step 8. Visualization

Step 9. scRNA-seq based annotation

Step 10. Create psudo multiomics cells

Step 11. Remove cells of low prediction score

Step 12. Gene expression projected into UMAP

Step 13. Identify peak

Step 14. Create a cell-by-peak matrix

Step 15. Identify differentially accessible regions

Step 16. Motif variability analysis

Step 17. De novo motif discovery

Step 18. Predict gene-enhancer pairs

Step 0. Data Download

開始分析之前，我們需要下載snaptools生成的snap文件和cellranger-atac產生的singlecell.csv文件。

# 下載所需的數據集和基因注釋信息
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_5k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_5k_nextgem/atac_pbmc_5k_nextgem_singlecell.csv
$ wget http://renlab.sdsc.edu/r3fang//share/github/PBMC_ATAC_RNA/atac_pbmc_10k_nextgem.snap
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/atac_pbmc_10k_nextgem_singlecell.csv
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_pbmc_10k_nextgem/hg19.blacklist.bed.gz
$ wget http://cf.10xgenomics.com/samples/cell-atac/1.1.0/gencode.v19.annotation.gene.bed
Step 1. Barcode selection

首先，我們根據以下兩個主要的標準來選擇高質量的barcodes：
1) number of unique fragments;
2) fragments in promoter ratio;

# 加載所需R包
> library(SnapATAC);
> snap.files = c(
    "atac_pbmc_5k_nextgem.snap",
    "atac_pbmc_10k_nextgem.snap"
  );
> sample.names = c(
    "PBMC 5K",
    "PBMC 10K"
  );
> barcode.files = c(
    "atac_pbmc_5k_nextgem_singlecell.csv",
    "atac_pbmc_10k_nextgem_singlecell.csv"
  );

# 讀取snap文件
> x.sp.ls = lapply(seq(snap.files), function(i){
    createSnap(
        file=snap.files[i],
        sample=sample.names[i]
    );
  })
> names(x.sp.ls) = sample.names;

# 讀取barcode信息
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = read.csv(
        barcode.files[i],
        head=TRUE
    );
    # remove NO BAROCDE line
    barcodes = barcodes[2:nrow(barcodes),];
    barcodes$logUMI = log10(barcodes$passed_filters + 1);
    barcodes$promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
    barcodes
  })

# 質控指標數據可視化
> plots = lapply(seq(snap.files), function(i){
    p1 = ggplot(
        barcode.ls[[i]],
        aes(x=logUMI, y=promoter_ratio)) +
        geom_point(size=0.3, col="grey") +
        theme_classic()    +
        ggtitle(sample.names[[i]]) +
        ylim(0, 1) + xlim(0, 6) +
        labs(x = "log10(UMI)", y="promoter ratio")
        p1
    })
> plots

# 查看snap對象信息
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
##
## $`PBMC 10K`
## number of barcodes: 20000
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

# for both datasets, we identify usable barcodes using [3.5-5] for log10(UMI) and [0.4-0.8] for promoter ratio as cutoff.
# 設定質控閾值進行篩選過濾
> cutoff.logUMI.low = c(3.5, 3.5);
> cutoff.logUMI.high = c(5, 5);
> cutoff.FRIP.low = c(0.4, 0.4);
> cutoff.FRIP.high = c(0.8, 0.8);

# 根據過濾的閾值進行barcode的篩選
> barcode.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    idx = which(
        barcodes$logUMI >= cutoff.logUMI.low[i] &
        barcodes$logUMI <= cutoff.logUMI.high[i] &
        barcodes$promoter_ratio >= cutoff.FRIP.low[i] &
        barcodes$promoter_ratio <= cutoff.FRIP.high[i]
    );
    barcodes[idx,]
  });

> x.sp.ls = lapply(seq(snap.files), function(i){
    barcodes = barcode.ls[[i]];
    x.sp = x.sp.ls[[i]];
    barcode.shared = intersect(x.sp@barcode, barcodes$barcode);
    x.sp = x.sp[match(barcode.shared, x.sp@barcode),];
    barcodes = barcodes[match(barcode.shared, barcodes$barcode),];
    x.sp@metaData = barcodes;
    x.sp
  })

> names(x.sp.ls) = sample.names;
> x.sp.ls
## $`PBMC 5K`
## number of barcodes: 4526
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
##
## $`PBMC 10K`
## number of barcodes: 9039
## number of bins: 0
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
# combine two snap object

# combine two snap object
# 使用Reduce函數將兩個snap對象進行合併
> x.sp = Reduce(snapRbind, x.sp.ls);
> x.sp@metaData["sample"] = x.sp@sample;

> x.sp
## number of barcodes: 13565
## number of bins: 0
## number of genes: 0
## number of peaks: 0

> table(x.sp@sample);
## PBMC 10K  PBMC 5K
##     9039     4526
Step 2. Add cell-by-bin matrix

接下來，我們使用addBmatToSnap函數生成5kb解析度的cell-by-bin計數矩陣，並將其添加到snap對象中。該函數將自動從兩個snap文件中讀取cell-by-bin矩陣，並將連接後的矩陣添加到snap對象的bmat屬性中。

# 使用addBmatToSnap函數計算cell-by-bin計數矩陣
> x.sp = addBmatToSnap(x.sp, bin.size=5000);
Step 3. Matrix binarization

我們使用makeBinary函數將cell-by-bin計數矩陣轉換為二進位矩陣。在count矩陣中，某些項具有異常高的覆蓋率，這可能是由比對錯誤造成的。因此，我們將刪除計數矩陣中top 0.1%的項，並將其餘非零的值轉換為1。

# 使用makeBinary函數將cell-by-bin計數矩陣轉換成二進位矩陣
> x.sp = makeBinary(x.sp, mat="bmat");
Step 4. Bin filtering

首先，我們過濾掉任何與ENCODE中blacklist區域重疊的bins，以防止潛在的artifacts。

> library(GenomicRanges);
# 讀取blacklist信息
> black_list = read.table("hg19.blacklist.bed.gz");
> black_list.gr = GRanges(
    black_list[,1],
    IRanges(black_list[,2], black_list[,3])
  );

> idy = queryHits(
    findOverlaps(x.sp@feature, black_list.gr)
  );
> if(length(idy) > 0){
    x.sp = x.sp[,-idy, mat="bmat"];
  };
> x.sp
## number of barcodes: 13565
## number of bins: 624794
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

接下來，我們移除那些不需要的染色體信息。

> chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];

> idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
> if(length(idy) > 0){
x.sp = x.sp[,-idy, mat="bmat"]
};
> x.sp
## number of barcodes: 13565
## number of bins: 624297
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

第三，bins的覆蓋率大致是服從對數正態分布的。我們將與不變特徵(如管家基因的啟動子)重疊的前5%的bins進行刪除。

> bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
> hist(
    bin.cov[bin.cov > 0],
    xlab="log10(bin cov)",
    main="log10(Bin Cov)",
    col="lightblue",
    xlim=c(0, 5)
  );
> bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
> idy = which(bin.cov <= bin.cutoff="" bin.cov=""> 0);
> x.sp = x.sp[, idy, mat="bmat"];
> x.sp
## number of barcodes: 13565
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0

最後，我們將進一步刪除bin覆蓋率小於1000的任何細胞，這是因為儘管有些細胞可能含有較高的unique fragments片段，但是過濾後的bin覆蓋率卻很低。此步驟是可選的，但強烈建議這樣做。

> idx = which(Matrix::rowSums(x.sp@bmat) > 1000);
> x.sp = x.sp[idx,];
> x.sp
## number of barcodes: 13434
## number of bins: 534985
## number of genes: 0
## number of peaks: 0
## number of motifs: 0
Step 5. Dimentionality reduction

SnapATAC採用diffusion maps算法進行數據降維，這是一種非線性降維的技術，它通過對數據執行random walk來發現低維的manifold，並且對噪音和擾動具有很強的魯棒性。

但是，diffusion maps算法的計算成本會隨著細胞數目的增加而呈現指數級增長的趨勢。為了克服這一局限，我們將Nyström方法(a sampling technique)和diffusion maps算法相結合，給出Nyström Landmark diffusion map來生成大規模數據集的低維嵌入。

Nyström landmark diffusion maps算法主要包括以下三個步驟：

sampling: sample a subset of K (K≪N) cells from N total cells as 「landmarks」. Instead of random sampling, here we adopted a density-based sampling approach developed in SCTransform to preserve the density distribution of the N original points;

embedding: compute a diffusion map embedding for K landmarks;

extension: project the remaining N-K cells onto the low-dimensional embedding as learned from the landmarks to create a joint embedding space for all cells.

在本示例中，我們將採樣10,000個細胞作為landmarks，並將餘下的query細胞投射到嵌入landmarks的diffusion maps上。

> row.covs.dens <- density(
    x = x.sp@metaData[,"logUMI"],
    bw = 'nrd', adjust = 1
  );
> sampling_prob <- 1 / (approx(x = row.covs.dens$x, y = row.covs.dens$y, xout = x.sp@metaData[,"logUMI"])$y + .Machine$double.eps);

> set.seed(1);
> idx.landmark.ds <- sort(sample(x = seq(nrow(x.sp)), size = 10000, prob = sampling_prob));

將x.sp對象分割為landmark(x.landmark.sp)細胞和query(x.query.sp)細胞。

> x.landmark.sp = x.sp[idx.landmark.ds,];
> x.query.sp = x.sp[-idx.landmark.ds,];

Run diffusion maps on the landmark cells.
使用landmark細胞進行diffusion maps降維處理

> x.landmark.sp = runDiffusionMaps(
    obj= x.landmark.sp,
    input.mat="bmat",
    num.eigs=50
  );
> x.landmark.sp@metaData$landmark = 1;

Porject query cells to landmarks.
將query細胞映射到landmarks上

> x.query.sp = runDiffusionMapsExtension(
    obj1=x.landmark.sp,
    obj2=x.query.sp,
    input.mat="bmat"
  );
> x.query.sp@metaData$landmark = 0;

Combine landmark and query cells.
合併landmark和query細胞
Note: To merge snap objects, all the matrix (bmat, gmat, pmat) and metaData must be of the same number of columns between snap objects.

> x.sp = snapRbind(x.landmark.sp, x.query.sp);
> x.sp = x.sp[order(x.sp@metaData[,"sample"])]; #IMPORTANT
Step 6. Determine significant components

接下來，我們將確定特徵向量（eigen-vectors）的數目，以便用於後續的分析。在下面的例子中，我們選擇前15個特徵向量。

> plotDimReductPW(
    obj=x.sp,
    eigs.dims=1:50,
    point.size=0.3,
    point.color="grey",
    point.shape=19,
    point.alpha=0.6,
    down.sample=5000,
    pdf.file.name=NULL,
    pdf.height=7,
    pdf.width=7
  );

Step 7. Graph-based clustering

接下來，我們使用所選的特徵向量成分，來構造K近鄰（KNN）聚類圖。其中，每個節點是一個細胞，根據歐氏距離來確定每個細胞的k個近鄰細胞。

> x.sp = runKNN(
    obj=x.sp,
    eigs.dims=1:20,
    k=15
  );

> library(leiden);
> x.sp=runCluster(
    obj=x.sp,
    tmp.folder=tempdir(),
    louvain.lib="leiden",
    seed.use=10,
    resolution=0.7
  );
Step 8. Visualization

SnapATAC可以使用tSNE(FI-tsne)和UMAP方法對降維聚類後的細胞進行可視化的展示和探索。在本示例中，我們使用UMAP方法進行展示。

> library(umap);
> x.sp = runViz(
    obj=x.sp,
    tmp.folder=tempdir(),
    dims=2,
    eigs.dims=1:20,
    method="umap",
    seed.use=10
  );

> par(mfrow = c(2, 2));
> plotViz(
    obj= x.sp,
    method="umap",
    main="Cluster",
    point.color=x.sp@cluster,
    point.size=0.2,
    point.shape=19,
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );
> plotFeatureSingle(
    obj=x.sp,
    feature.value=x.sp@metaData[,"logUMI"],
    method="umap",
    main="Read Depth",
    point.size=0.2,
    point.shape=19,
    down.sample=10000,
    quantiles=c(0.01, 0.99)
  );
> plotViz(
    obj= x.sp,
    method="umap",
    main="Sample",
    point.size=0.2,
    point.shape=19,
    point.color=x.sp@sample,
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
    );
> plotViz(
    obj= x.sp,
    method="umap",
    main="Landmark",
    point.size=0.2,
    point.shape=19,
    point.color=x.sp@metaData[,"landmark"],
    text.add=FALSE,
    text.size=1.5,
    text.color="black",
    down.sample=10000,
    legend.add=TRUE
  );

Step 9. scRNA-seq based annotation

在本示例中，我們將使用相應的scRNA-seq數據集來對scATAC-seq數據的細胞類群進行注釋。我們可以通過(https://www.dropbox.com/s/3f3p5nxrn5b3y4y/pbmc_10k_v3.rds?dl=1)地址下載所需的10X PBMC scRNA-seq(pbmc_10k_v3.rds)的Seurat對象。

> library(Seurat);
> pbmc.rna = readRDS("pbmc_10k_v3.rds");
> pbmc.rna$tech = "rna";
> variable.genes = VariableFeatures(object = pbmc.rna);

> genes.df = read.table("gencode.v19.annotation.gene.bed");
> genes.gr = GRanges(genes.df[,1], IRanges(genes.df[,2], genes.df[,3]), name=genes.df[,4]);
> genes.sel.gr = genes.gr[which(genes.gr$name %in% variable.genes)];

## reload the bmat, this is optional but highly recommanded
> x.sp = addBmatToSnap(x.sp);
> x.sp = createGmatFromMat(
    obj=x.sp,
    input.mat="bmat",
    genes=genes.sel.gr,
    do.par=TRUE,
    num.cores=10
  );

接下來，我們將snap對象轉換為Seurat對象，用於後續的數據整合。

# 使用snapToSeurat函數將snap對象轉換為Seurat對象
> pbmc.atac <- snapToSeurat(
    obj=x.sp,
    eigs.dims=1:20,
    norm=TRUE,
    scale=TRUE
  );

# 識別整合的Anchors信息
> transfer.anchors <- FindTransferAnchors(
    reference = pbmc.rna,
    query = pbmc.atac,
    features = variable.genes,
    reference.assay = "RNA",
    query.assay = "ACTIVITY",
    reduction = "cca"
  );

# 使用TransferData函數進行數據映射整合
> celltype.predictions <- TransferData(
    anchorset = transfer.anchors,
    refdata = pbmc.rna$celltype,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@metaData$predicted.id = celltype.predictions$predicted.id;
> x.sp@metaData$predict.max.score = apply(celltype.predictions[,-1], 1, max);
> x.sp@cluster = as.factor(x.sp@metaData$predicted.id);
Step 10. Create psudo multiomics cells

現在，snap對象x.sp中包含了的每個細胞的染色質可及性@bmat和基因表達@gmat的信息。

> refdata <- GetAssayData(
    object = pbmc.rna,
    assay = "RNA",
    slot = "data"
  );
> imputation <- TransferData(
    anchorset = transfer.anchors,
    refdata = refdata,
    weight.reduction = pbmc.atac[["SnapATAC"]],
    dims = 1:20
  );
> x.sp@gmat = t(imputation@data);
> rm(imputation); # free memory
> rm(refdata);    # free memory
> rm(pbmc.rna);   # free memory
> rm(pbmc.atac); # free memory
Step 11. Remove cells of low prediction score> hist(
    x.sp@metaData$predict.max.score,
    xlab="prediction score",
    col="lightblue",
    xlim=c(0, 1),
    main="PBMC 10X"
  );
> abline(v=0.5, col="red", lwd=2, lty=2);

> table(x.sp@metaData$predict.max.score > 0.5);
## FALSE  TRUE
##  331 13103

> x.sp = x.sp[x.sp@metaData$predict.max.score > 0.5,];
> x.sp
## number of barcodes: 13045
## number of bins: 627478
## number of genes: 19089
## number of peaks: 0

> plotViz(
    obj=x.sp,
    method="umap",
    main="PBMC 10X",
    point.color=x.sp@metaData[,"predicted.id"],
    point.size=0.5,
    point.shape=19,
    text.add=TRUE,
    text.size=1,
    text.color="black",
    down.sample=10000,
    legend.add=FALSE
  );

Step 12. Gene expression projected onto UMAP

接下來，我們使用UMAP方法對一些marker基因的表達水平進行可視化展示。

> marker.genes = c(
    "IL32", "LTB", "CD3D",
    "IL7R", "LDHB", "FCGR3A",
    "CD68", "MS4A1", "GNLY",
    "CD3E", "CD14", "CD14",
    "FCGR3A", "LYZ", "PPBP",
    "CD8A", "PPBP", "CST3",
    "NKG7", "MS4A7", "MS4A1",
    "CD8A"
  );

> par(mfrow = c(3, 3));
> for(i in 1:9){
    j = which(colnames(x.sp@gmat) == marker.genes[i])
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@gmat[,j],
        method="umap",
        main=marker.genes[i],
        point.size=0.1,
        point.shape=19,
        down.sample=10000,
        quantiles=c(0.01, 0.99)
)};

Step 13. Identify peaks

接下來，我們將每個cluster群中的reads進行聚合，創建用於peak calling和可視化的集成track。執行完peak calling後，將生成一個.narrowPeak的文件，該文件中包含了識別出的peaks信息，和一個.bedgraph的文件，可以用於可視化展示。為了獲得最robust的結果，我們不建議對細胞數目小於100的cluster群執行此步驟。

> system("which snaptools");
/home/r3fang/anaconda2/bin/snaptools
> system("which macs2");
/home/r3fang/anaconda2/bin/macs2

# 使用runMACS函數調用macs2進行peak calling
> peaks = runMACS(
    obj=x.sp[which(x.sp@cluster=="CD4 Naive"),],
    output.prefix="PBMC.CD4_Naive",
    path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
    path.to.macs="/home/r3fang/anaconda2/bin/macs2",
    gsize="hs", # mm, hs, etc
    buffer.size=500,
    num.cores=10,
    macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
    tmp.folder=tempdir()
    );

為了對所有的cluster群執行peak calling，我們提供了一個簡便的腳本來實現該步驟。

# call peaks for all cluster with more than 100 cells
# 選出那些細胞數目大於100的cluster群
> clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 100)];

# 批量對不同的cluster進行peak calling
> peaks.ls = mclapply(seq(clusters.sel), function(i){
    print(clusters.sel[i]);
    peaks = runMACS(
        obj=x.sp[which(x.sp@cluster==clusters.sel[i]),],
        output.prefix=paste0("PBMC.", gsub(" ", "_", clusters.sel)[i]),
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="hs", # mm, hs, etc
        buffer.size=500,
        num.cores=1,
        macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
        tmp.folder=tempdir()
   );
    peaks
     }, mc.cores=5);

> peaks.names = system("ls | grep narrowPeak", intern=TRUE);

> peak.gr.ls = lapply(peaks.names, function(x){
    peak.df = read.table(x)
    GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
  })

# 合併peak信息
> peak.gr = reduce(Reduce(c, peak.gr.ls));

我們可以將每個cluster群生成的bdg文件導入到IGV或其他基因組瀏覽器(如UW genome browser)進行可視化的展示，下面是來自UW genome browser的FOXJ2基因及其側翼區域的screenshot。

Step 14. Create a cell-by-peak matrix

接下來，我們使用合併後的peaks作為參考，來創建一個cell-by-peak計數矩陣，並將其添加到snap對象中。
首先，我們將合併後的peaks信息寫入到peaks.combined.bed文件中

> peaks.df = as.data.frame(peak.gr)[,1:3];
> write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
        quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".",
        row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
        fileEncoding = "")

接下來，我們創建一個cell-by-peak計數矩陣，並將其添加到snap對象中。這一步可能需要一段時間。

$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_10k_nextgem.snap \
    --peak-file peaks.combined.bed &
$ snaptools snap-add-pmat \
    --snap-file atac_pbmc_5k_nextgem.snap \
    --peak-file peaks.combined.bed

然後，我們將計算好的cell-by-peak計數矩陣添加到已有的snap對象中。

> x.sp = addPmatToSnap(x.sp);
Step 15. Identify differentially accessible peaks

對於一組給定的細胞Ci，我們首先在diffusion component空間中尋找鄰近的細胞Cj (|Ci|=|Cj|)作為「background」細胞進行比較。如果Ci細胞的數目佔總細胞數目的一半以上，則使用剩餘的細胞作為local background。接下來，我們將Ci和Cj細胞進行聚合，來創建兩個raw-count向量，即Vci和Vcj。然後，我們使用R包edgeR (v3.18.1)的精確檢驗（exact test）來對Vci和Vcj進行差異分析，其中，對於小鼠設置BCV=0.1，而人類設置BCV= 0.4。最後，再使用Benjamini-Hochberg多重檢驗校正方法，來調整p-value值為錯誤發現率(FDR)值。對於FDR值小於0.05的peaks，被選為顯著的DARs。

我們發現這種方法也存在一定的局限性，特別是對於那些細胞數目較少的cluster群，統計的有效性可能不夠強大。對於那些未能識別出顯著差異peaks的cluster群，我們將根據富集p-value值對元素進行排序，並挑選出最具代表性的2,000個peaks進行後續的motif分析。

> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="CD14+ Monocytes",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

> par(mfrow = c(1, 2));
> plot(DARs$logCPM, DARs$logFC,
    pch=19, cex=0.1, col="grey",
    ylab="logFC", xlab="logCPM",
    main="CD14+ Monocytes"
  );
> points(DARs$logCPM[idy],
    DARs$logFC[idy],
    pch=19,
    cex=0.5,
    col="red"
  );
> abline(h = 0, lwd=1, lty=2);

> covs = Matrix::rowSums(x.sp@pmat);
> vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
> vals.zscore = (vals - mean(vals)) / sd(vals);

> plotFeatureSingle(
    obj=x.sp,
    feature.value=vals.zscore,
    method="umap",
    main="CD14+ Monocytes",
    point.size=0.1,
    point.shape=19,
    down.sample=5000,
    quantiles=c(0.01, 0.99)
  );Step 16. Motif variability analysis

SnapATAC可以調用chromVAR(Schep et al)程序進行motif可變性分析。

# 加載所需的R包
> library(chromVAR);
> library(motifmatchr);
> library(SummarizedExperiment);
> library(BSgenome.Hsapiens.UCSC.hg19);

> x.sp = makeBinary(x.sp, "pmat");

# 使用runChromVAR函數調用ChromVAR進行motif可變性分析
> x.sp@mmat = runChromVAR(
    obj=x.sp,
    input.mat="pmat",
    genome=BSgenome.Hsapiens.UCSC.hg19,
    min.count=10,
    species="Homo sapiens"
  );
> x.sp;
## number of barcodes: 13103
## number of bins: 627478
## number of genes: 19089
## number of peaks: 157750
## number of motifs: 271

> motif_i = "MA0071.1_RORA";
> dat = data.frame(x=x.sp@metaData$predicted.id, y=x.sp@mmat[,motif_i]);
> p <- ggplot(dat, aes(x=x, y=y, fill=x)) +
    theme_classic() +
    geom_violin() +
    xlab("cluster") +
    ylab("motif enrichment") +
    ggtitle("MA0071.1_RORA") +
    theme(
          plot.margin = margin(5,1,5,1, "cm"),
          axis.text.x = element_text(angle = 90, hjust = 1),
          axis.ticks.x=element_blank(),
          legend.position = "none"
   );

Step 17. De novo motif discovery

SnapATAC還可以調用homer用於對識別出的差異可及性區域(DARs)進行motif的識別和富集分析.

# 查看findMotifsGenome.pl程序安裝的路徑
> system("which findMotifsGenome.pl");
/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl

# 使用findDAR函數鑑定DAR區域
> DARs = findDAR(
    obj=x.sp,
    input.mat="pmat",
    cluster.pos="Double negative T cell",
    cluster.neg.method="knn",
    test.method="exactTest",
    bcv=0.4, #0.4 for human, 0.1 for mouse
    seed.use=10
  );
> DARs$FDR = p.adjust(DARs$PValue, method="BH");
> idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);

# 使用runHomer函數調用homer進行de novo motif discovery
> motifs = runHomer(
    x.sp[,idy,"pmat"],
    mat = "pmat",
    path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
    result.dir = "./homer/DoubleNegativeTcell",
    num.cores=5,
    genome = 'hg19',
    motif.length = 10,
    scan.size = 300,
    optimize.count = 2,
    background = 'automatic',
    local.background = FALSE,
    only.known = TRUE,
    only.denovo = FALSE,
    fdr.num = 5,
    cache = 100,
    overwrite = TRUE,
    keep.minimal = FALSE
    );

Step 18. Predict gene-enhancer pairs

最後，我們還可以使用一種「pseudo」細胞的方法，根據單個細胞中基因的表達和遠端調控元件的染色質可及性的關係，將遠端調控元件連接到目標基因上。對於給定的marker基因，我們首先識別出marker基因側翼區域內的peak。對於每個側翼的peak，我們使用基因表達作為輸入變量進行邏輯回歸來預測binarized的染色質狀態。產生的結果估計了染色質可及性與基因表達之間聯繫的重要性。

> TSS.loci = GRanges("chr12", IRanges(8219067, 8219068));

> pairs = predictGenePeakPair(
    x.sp,
    input.mat="pmat",
    gene.name="C3AR1",
    gene.loci=resize(TSS.loci, width=500000, fix="center"),
    do.par=FALSE
    );

# convert the pair to genome browser arc plot format
> pairs.df = as.data.frame(pairs);
> pairs.df = data.frame(
    chr1=pairs.df[,"seqnames"],
    start1=pairs.df[,"start"],
    end1=pairs.df[,"end"],
    chr2="chr2",
    start2=8219067,
    end2=8219068,
    Pval=pairs.df[,"logPval"]
    );
> head(pairs.df)
##    chr1  start1    end1 chr2  start2    end2       Pval
## 1 chr12 7984734 7985229 chr2 8219067 8219068 14.6075918
## 2 chr12 7987561 7988085 chr2 8219067 8219068  5.6718381
## 3 chr12 7989776 7990567 chr2 8219067 8219068 24.2564608
## 4 chr12 7996454 7996667 chr2 8219067 8219068  0.6411017
## 5 chr12 8000059 8000667 chr2 8219067 8219068  2.0324922
## 6 chr12 8012404 8013040 chr2 8219067 8219068  0.0000000

參考來源：https://gitee.com/booew/SnapATAC/blob/master/examples/10X_PBMC_15K/README.md

超能餅乾|SnapATAC分析單細胞ATAC-seq數據(三)

相關焦點

超能餅乾|SnapATAC分析單細胞ATAC-seq數據(一)

...學院張強鋒課題組利用深度學習人工智慧算法分析單細胞ATAC-seq...

缺什麼來什麼,單細胞ATAC的數據有救了!

從數據分析到結論產生,談談scATAC-seq

【中文在線網絡研討會】單細胞ATAC-Seq入門:高解析度表觀基因組學研究

解讀單細胞RNA-seq技術

綜述科普|染色質調控區域的研究:對CHIP-seq和ATAC-seq發展的深入思考

Direct-seq:單細胞水平分析CRISPR基因編輯篩選實驗

一文讀懂表觀遺傳學研究利器——ATAC-seq技術及應用丨深度長文

Direct-seq:單細胞水平分析CRISPR基因編輯篩選實驗 | Genome...

Nature||單細胞轉錄組測序和ATAC-Seq聯合解碼人類大腦海馬體體發育與形成機制

mtscATAC-seq:單細胞線粒體DNA基因分型與染色質分析新方法

ATAC-Seq Motif 富集分析

scRNA-seq數據差異基因表達分析的有效方法有哪些?

單細胞數據分析神器——Seurat

學徒跟著B站ATAC-seq視頻5天完成流程

吳昊團隊開發檢測單細胞mRNA動態變化新技術scNT-seq

單細胞轉錄組高級分析四:scRNA數據推斷CNV

單細胞測序揭秘COVID-19,TotalSeqTM & scRNA-seq 「珠聯璧合」不能少

何愛彬研究組開發高通量單細胞ChIP-seq技術——CoBATCH並解析多...