單細胞轉錄組雖然說不太可能跟傳統的bulk轉錄組那樣對每個樣本都測定到好幾萬基因的表達量,如果是10x這樣的技術,每個細胞也就幾百個基因是有表達量的,但是架不住細胞數量多,對大量細胞來說,這個表達矩陣的基因數量就很可觀了。一般來說,gencode資料庫的gtf文件注釋到的五萬多個基因裡面,包括蛋白編碼基因和非編碼的,可以在單細胞表達矩陣裡面過濾低表達量後,還剩下2萬多個左右。
2萬個基因,近萬個細胞的表達矩陣,降維起來是一個浩大的計算量,所以有一個步驟,就是從2萬個基因裡面挑選一下 highly variable genes (HVG) 進行下遊分析,從此就假裝自己的單細胞轉錄組就僅僅是測到了HVGs,數量嘛,兩千多吧!
那麼,問題就來了, 什麼樣的標準,算法來挑選 highly variable genes (HVG) 進行下遊分析呢?
搜索最新文獻簡單谷歌搜索highly variable genes (HVG) 加上單細胞,這樣的關鍵詞就可以看到很多手把手教程:
2019六月的:Current best practices in single‐cell RNA‐seq analysis: a tutorial 地址:https://www.embopress.org/doi/10.15252/msb.20188746
bioconductor教程:Using scran to analyze single-cell RNA-seq data https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html
一步步手把手流程:https://dockflow.org/workflow/simple-single-cell/
基本上經過四五個小時的模式,你就會總結到一些常見的挑選 highly variable genes (HVG) 的算法和R包,我簡單羅列5個,會對比一下:
seurat包的FindVariableFeatures函數,默認挑選2000個
monocle包的monocle_hvg函數
scran包的decomposeVar函數
M3Drop的BrenneckeGetVariableGenes
以及文獻參考::(Brennecke et al 2013 method) Extract genes with a squared coefficient of variation >2 times the fit regression
在scRNAseq數據集比較這5個方法接下來我就沒有空繼續做這些瑣碎的比較啦,老規矩,跟我們的單細胞一期和二期學習視頻一下,學習筆記我們有專員操刀,下面看公司學習者的表演:
目的:利用scRNA-seq包的表達矩陣測試幾個R包尋找HVGs,然後畫upsetR圖看看不同方法的HVG的重合情況。
rm(list = ls())
options(stringsAsFactors = F)
包中內置了Pollen et al. 2014 的數據集(https://www.nature.com/articles/nbt.2967),到19年8月為止,已經有446引用量了。只不過原文完整的數據是 23730 features, 301 samples,這個包中只選取了4種細胞類型:pluripotent stem cells 分化而成的 neural progenitor cells (NPC,神經前體細胞) ,還有 GW16(radial glia,放射狀膠質細胞) 、GW21(newborn neuron,新生兒神經元) 、GW21+3(maturing neuron,成熟神經元)
library(scRNAseq)
#
data(fluidigm)
# 得到RSEM矩陣
assay(fluidigm) <- assays(fluidigm)$rsem_counts
ct <- floor(assays(fluidigm)$rsem_counts)
ct[1:4,1:4]
#
#
#
#
#
# 樣本注釋信息
pheno_data <- as.data.frame(colData(fluidigm))
table(pheno_data$Coverage_Type)
#
#
#
table(pheno_data$Biological_Condition)
#
#
#
suppressMessages(library(Seurat))
packageVersion("Seurat")
seurat_sce <- CreateSeuratObject(counts = ct,
meta.data = pheno_data,
min.cells = 5,
min.features = 2000,
project = "seurat_sce")
seurat_sce <- NormalizeData(seurat_sce)
seurat_sce <- FindVariableFeatures(seurat_sce, selection.method = "vst", nfeatures=2000)
VariableFeaturePlot(seurat_sce)
seurat_hvg <- VariableFeatures(seurat_sce)
length(seurat_hvg)
## [1] 2000
head(seurat_hvg)
## [1] "FOS" "ERBB4" "SCD" "SGPL1" "ARID5B" "IGFBPL1"
library(monocle)
gene_ann <- data.frame(
gene_short_name = row.names(ct),
row.names = row.names(ct)
)
sample_ann <- pheno_data
# 然後轉換為AnnotatedDataFrame對象
pd <- new("AnnotatedDataFrame",
data=sample_ann)
fd <- new("AnnotatedDataFrame",
data=gene_ann)
monocle_cds <- newCellDataSet(
ct,
phenoData = pd,
featureData =fd,
expressionFamily = negbinomial.size(),
lowerDetectionLimit=1)
monocle_cds
#
#
#
#
#
#
#
#
#
#
#
#
#
#
monocle_cds <- estimateSizeFactors(monocle_cds)
monocle_cds <- estimateDispersions(monocle_cds)
disp_table <- dispersionTable(monocle_cds)
unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1)
monocle_cds <- setOrderingFilter(monocle_cds, unsup_clustering_genes$gene_id)
plot_ordering_genes(monocle_cds)
monocle_hvg <- unsup_clustering_genes[order(unsup_clustering_genes$mean_expression, decreasing=TRUE),][,1]
length(monocle_hvg)
## [1] 13986
# 也取前2000個
monocle_hvg <- monocle_hvg[1:2000]
head(monocle_hvg)
## [1] "MALAT1" "TUBA1A" "MAP1B" "SOX4" "EEF1A1" "SOX11"
library(SingleCellExperiment)
library(scran)
scran_sce <- SingleCellExperiment(list(counts=ct))
scran_sce <- computeSumFactors(scran_sce)
scran_sce <- normalize(scran_sce)
fit <- trendVar(scran_sce, parametric=TRUE,use.spikes=FALSE)
dec <- decomposeVar(scran_sce, fit)
plot(dec$mean, dec$total, xlab="Mean log-expression",
ylab="Variance of log-expression", pch=16)
curve(fit$trend(x), col="dodgerblue", add=TRUE)
scran_df <- dec
scran_df <- scran_df[order(scran_df$bio, decreasing=TRUE),]
scran_hvg <- rownames(scran_df)[scran_df$FDR<0.1]
length(scran_hvg)
## [1] 875
head(scran_hvg)
## [1] "IGFBPL1" "STMN2" "ANP32E" "EGR1" "DCX" "XIST"
library(M3DExampleData)
# 需要提供表達矩陣(expr_mat)=》normalized or raw (not log-transformed)
HVG <-M3Drop::BrenneckeGetVariableGenes(expr_mat=ct, spikes=NA, suppress.plot=FALSE, fdr=0.1, minBiolDisp=0.5, fitMeanQuantile=0.8)
M3Drop_hvg <- rownames(HVG)
length(M3Drop_hvg)
## [1] 917
head(M3Drop_hvg)
## [1] "CCL2" "EMP1" "ZNF630" "NRDE2" "PLCL1" "KCTD21"
來自:(Brennecke et al 2013 method) Extract genes with a squared coefficient of variation >2 times the fit regression
實現了:Select the highly variable genes based on the squared coefficient of variation and the mean gene expression and return the RPKM matrix the the HVG
if(T){
getMostVarGenes <- function(
data=data,
fitThr=1.5,
minMeanForFit=1
){
data_no0 <- as.matrix(
data[rowSums(data)>0,]
)
meanGeneExp <- rowMeans(data_no0)
names(meanGeneExp)<- rownames(data_no0)
varGenes <- rowVars(data_no0)
cv2 <- varGenes / meanGeneExp^2
useForFit <- meanGeneExp >= minMeanForFit
fit <- glmgam.fit( cbind( a0 = 1,
a1tilde = 1/meanGeneExp[useForFit] ),
cv2[useForFit] )
a0 <- unname( fit$coefficients["a0"] )
a1 <- unname( fit$coefficients["a1tilde"])
fit_genes <- names(meanGeneExp[useForFit])
cv2_fit_genes <- cv2[useForFit]
fitModel <- fit$fitted.values
names(fitModel) <- fit_genes
HVGenes <- fitModel[cv2_fit_genes>fitModel*fitThr]
print(length(HVGenes))
plot_meanGeneExp <- log10(meanGeneExp)
plot_cv2 <- log10(cv2)
plotData <- data.frame(
x=plot_meanGeneExp[useForFit],
y=plot_cv2[useForFit],
fit=log10(fit$fitted.values),
HVGenes=log10((fit$fitted.values*fitThr))
)
p <- ggplot(plotData, aes(x,y)) +
geom_point(size=0.1) +
geom_line(aes(y=fit), color="red") +
geom_line(aes(y=HVGenes), color="blue") +
theme_bw() +
labs(x = "Mean expression (log10)", y="CV2 (log10)")+
ggtitle(paste(length(HVGenes), " selected genes", sep="")) +
theme(
axis.text=element_text(size=16),
axis.title=element_text(size=16),
legend.text = element_text(size =16),
legend.title = element_text(size =16 ,face="bold"),
legend.position= "none",
plot.title = element_text(size=18, face="bold", hjust = 0.5),
aspect.ratio=1
)+
scale_color_manual(
values=c("#595959","#5a9ca9")
)
print(p)
HVG <- data_no0[rownames(data_no0) %in% names(HVGenes),]
return(HVG)
}
}
library(statmod)
diy_hvg <- rownames(getMostVarGenes(ct))
看起來代碼比較長,主要是繪圖的時候拖拉了。
image-20191127150504326length(diy_hvg)
## [1] 2567
head(diy_hvg)
## [1] "A2M" "A2MP1" "AAMP" "ABCA11P" "ABCA3" "ABCB10"
require(UpSetR)
input <- fromList(list(seurat=seurat_hvg, monocle=monocle_hvg,
scran=scran_hvg, M3Drop=M3Drop_hvg, diy=diy_hvg))
upset(input)
很多圖都是可以美化的,不過並不是我們的重點
image-20191127150523192可以看到,不同的R包和方法,差異不是一般的大啊!!!
更多方法比如基尼係數:findHVG: Finding highly variable genes (HVG) based on Gini 見:https://rdrr.io/github/jingshuw/descend/man/findHVG.html
趕快試一下吧,探索的過程中,你就會對單細胞數據處理的認知加強了。
什麼才是你應該學的單細胞如果是10X儀器的單細胞轉錄組數據走cellranger流程,我們在單細胞天地多次分享過流程筆記:
如果是smart-seq2技術,首先走單細胞下遊分析標準流程啊,就是那些R包的認知,包括 scater,monocle,Seurat,scran,M3Drop 需要熟練掌握它們的對象,:一些單細胞轉錄組R包的對象 ,分析流程也大同小異:
step1: 創建對象
step2: 質量控制
step3: 表達量的標準化和歸一化
step4: 去除幹擾因素(多個樣本整合)
step5: 判斷重要的基因
step6: 多種降維算法
step7: 可視化降維結果
step8: 多種聚類算法
step9: 聚類後找每個細胞亞群的標誌基因
step10: 繼續分類
最後說幾句話