使用STRING資料庫進行蛋白互作分析是生信常規下遊分析項目之一。
本文將通過R包STRINGdb來進行string蛋白互作分析,同時會利用igraph和ggraph對互作網絡進行可視化。
STRINGdb包用於蛋白互作分析STRINGdb包有別於其他的R包,它的幫助信息不是使用help函數查看,而是傳給STRINGdb$help(),如使用STRINGdb$help("map")查看map函數的幫助。如需要查看它的vignette文檔,可以使用vignette("STRINGdb")命令。
map和plot_network用於蛋白互作分析
進行分析前,先要創建一個STRINGdb對象:
library(tidyverse)
library(clusterProfiler)
library(org.Mm.eg.db)
library(STRINGdb)
library(igraph)
library(ggraph)
# 創建STRINGdb對象
string_db <- STRINGdb$new( version="10", species=10090,
score_threshold=400, input_directory="")
其中species代表物種編碼,它是NCBI Taxonomy的物種編碼(http://www.ncbi.nlm.nih.gov/taxonomy),其中人的編碼為9606,小鼠的編碼為10090,這裡使用小鼠的編碼,因為下面分析的基因來源於小鼠。
score_threshold是蛋白互作的得分,此值會用於篩選互作結果,400是默認分值,如果要求嚴格可以調高此值。
# 需要分析的基因
# 這些基因我測試過,就不再找其他的基因測試了(比如clusterProfiler裡面的geneList基因集)
gene <- c('Drd2','Lrrc10b','Adora2a','Gpr6','Syndig1l','Rgs9','Drd1','Scn4b','Gpr88','Pde10a',
'Rasd2','Ppp1r1b','Pde1b','Adcy5','Gng7','Arpp21','Hpca','Kcnab1','Ptpn5','St8sia3',
'Spock3','Penk','Rxrg','Tmem158','Pcp4','Tac1','Gnal','Rasgrp2','Ano3','Vxn','Trbc2',
'Ighm','Tbr1','Nptx1','Nr4a2','Ccn2','Plp1','Mbp','Trf','Cnp','Mobp','Qdpr','Mal',
'Mag','Plekhb1','Cldn11','Cryab','Car2','Tspan2','Csrp1','Sept4','Bcas1','Mog','Qk',
'Ermn','Ndrg1','Apod','Gatm','Olig1','Gpr37','Tmem88b','Pllp','Dbndd2','Opalin',
'Ppp1r14a','Ugt8a','Fa2h','Grb14','Gsn','Evi2a','Lamp5','Rasgrf2','Eomes','Ms4a15',
'Th','Cdhr1','Nmb','Doc2g','S100a5','Islr','Ly6g6e','Calb2','Trh','Fabp7','Slc6a11',
'Csdc2','Nrip3','Lbhd2','Reln','Rab3b','Ptn','Kctd12','Ptgds','Mgp','Stx1a','Baiap3',
'Nts','Zcchc12','Nap1l5','Resp18','Tmem130','Ndn','Gprasp2','Lypd1','Ahi1','Hap1',
'Hpcal1','Pnck','Vat1l','Wdr6','Nnat','6330403K07Rik','Sst','Pdyn','Tac1','Strip2',
'Drd1','Wfs1','Ppp1r2','Penk','Rgs9','Pde1b','Pde10a','Gpr88','Ppp1r1b','Tafa2',
'Lmo3','Slc30a3','Nptxr','Rprm','Cartpt','Rhcg','Ctxn3','Sp8','Lgr6','Dlx2','Dlx1',
'Dcx','Shisa8','Ablim3','Sp9','Tuba8','Myo16','Isoc1','Sall1','Cacng5','Pbx3','Gpsm1',
'Gng4','Kank3','Tshz1','Pcbp3','Ptpro','Cpne4','Ppp1r12b','Synpr','Gm17750','Ampd2',
'Meis2','Ankrd6','Ppm1e','Pbx1','Inpp5j','Grb2','Csdc2','Gprc5b','Nrep','Rgs12',
'Cpne6','Kcnip1','Rasa2','Dclk2','Gprin1','Btg1','Dync1i1','Sox4','Gad1','Gm27032',
'Ppp1cc','Syt6','Eml5','Rragd','Pcp4l1','Nrip3','Rasl11b','Gdpd5','Prkca','Marcksl1',
'Tubb2b','Pacsin2','Kcnf1','Hap1','Calb2','Stxbp6','Nr2f2','Six3','Ramp3','Pvalb',
'Unc13c','Vamp1','Gpx1','Kcnc1','Spp1','Gad1','Agt','Lgi2','Nefh','Prkcd','Esrrg',
'Kcnip1','Prr32','Cldn2','Calml4','Slc4a5','Otx2','Kcne2','Steap1','Tmem72','Ccdc153',
'Tmem212','F5','Col8a2','Kcnj13','Wfdc2','Folr1','Cfap126','Lbp','2900040C04Rik',
'Trpv4','Sostdc1','Col8a1','Prlr','Tuba1c','Dynlrb2','Wdr86','Abca4','Msx1','Kl',
'Gm19935','Foxj1','Ecrg4','Rdh5','Rsph1','Mia','Cdkn1c','Aqp1','Cox6b2','Clic6',
'Prelp','Slc2a12','Slc13a4','Col9a3','Ace','1110017D15Rik','Igf2','Bst2','Car12',
'Lgals3bp','Cab39l','Slc29a4','Sulf1','Pcolce','Ttr','Enpp2','Fxyd1','Slc4a2','Pltp',
'Igfbp2','Rbp1','Rarres2','Cd24a','Ezr','Trpm3','1700094D03Rik','Car14','Ctsd',
'Slc31a1','Ifi27','Cd9','Ifitm3','Dbi','Ccnd2','Vamp8','Ucp2','Cd63','Hemk1','Spint2',
'Slc12a2','Plxnb2','Ephx1','Stk39','Tcn2','Vim','Emb','Hba-a2','Hbb-bs','Hba-a1',
'Gfap','Spink8','Rasd1','Shisa6','Npy2r','Cabp7','Olfml2b','Homer3','Neurod6','Nr4a3',
'Tanc1','Cnih2','Gabra5','Neurod2','Wipf3','Nr3c2','Zbtb18','Pcdh20','Arhgef25',
'Prdm8','Epha4','Arpc5','Nell2','Hpca','Slit1','Crym','Hs3st4','Prkca','Chgb','Cpne6',
'Dock9','Prkcg','Cpne7','Grin2a','Dkk3','Nt5dc3','Ak5','Rgs14','St6galnac5','Nptxr',
'Nptx1','Bok')
# 將Gene Symbol轉換為Entrez ID
gene <- gene %>% bitr(fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = "org.Mm.eg.db",
drop = T)
使用map函數用於將基因匹配到STRING資料庫的ID,map函數的幫助信息可以查看STRINGdb$help("map")。然後plot_network繪圖即可。
data_mapped <- gene %>% string_db$map(my_data_frame_id_col_names = "ENTREZID",
removeUnmappedRows = TRUE)
string_db$plot_network( data_mapped$STRING_id )
可以發現和官網的出圖是一樣的。這裡不是重點,只是為了說明STRINGdb的基礎用法,就不再展開了。
使用get_interactions獲取互作信息用於後續可視化分析
使用get_interactions獲取蛋白互作信息,以用於後續可視化。
data_links <- data_mapped$STRING_id[1:100] %>% string_db$get_interactions()
data_links包含了蛋白互作的信息,比較重要的是前兩列和最後一列:from、to、combined_score,前兩列指定了蛋白互作關係的基因對,最後一列是此蛋白互作關係的得分。
data_links數據將用於後續分析。
使用igraph和ggraph可視化蛋白互作網絡圖先使用igraph創建網絡數據,並進行必要的處理,然後轉到ggraph繪圖。
# 轉換stringID為Symbol,只取前兩列和最後一列
links <- data_links %>%
mutate(from = data_mapped[match(from, data_mapped$STRING_id), "SYMBOL"]) %>%
mutate(to = data_mapped[match(to, data_mapped$STRING_id), "SYMBOL"]) %>%
dplyr::select(from, to , last_col()) %>%
dplyr::rename(weight = combined_score)
# 節點數據
nodes <- links %>% { data.frame(gene = c(.$from, .$to)) } %>% distinct()
# 創建網絡圖
# 根據links和nodes創建
net <- igraph::graph_from_data_frame(d=links,vertices=nodes,directed = F)
# 添加一些參數信息用於後續繪圖
# V和E是igraph包的函數,分別用於修改網絡圖的節點(nodes)和連線(links)
igraph::V(net)$deg <- igraph::degree(net) # 每個節點連接的節點數
igraph::V(net)$size <- igraph::degree(net)/5 #
igraph::E(net)$width <- igraph::E(net)$weight/10
# 使用ggraph繪圖
# ggraph是基於ggplot2的包,語法和常規ggplot2類似
ggraph(net,layout = "kk")+
geom_edge_fan(aes(edge_width=width), color = "lightblue", show.legend = F)+
geom_node_point(aes(size=size), color="orange", alpha=0.7)+
geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = T)+
scale_edge_width(range = c(0.2,1))+
scale_size_continuous(range = c(1,10) )+
guides(size=F)+
theme_graph()
這裡的參數設置是節點的大小和其連接的線的數量有關,線數量越多則點越大;線的寬度和其蛋白互作的得分有關,得分越高則越寬。只顯示節點數大於5的基因名稱。
布局方式有多種,除了剛才的kk方法,還有stress布局:
ggraph(net,layout = "stress")+
geom_edge_fan(aes(edge_width=width), color = "lightblue", show.legend = F)+
geom_node_point(aes(size=size), color="orange", alpha=0.7)+
geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = T)+
scale_edge_width(range = c(0.2,1))+
scale_size_continuous(range = c(1,10) )+
guides(size=F)+
theme_graph()
環形布局:
ggraph(net,layout = "linear", circular = TRUE)+
geom_edge_fan(aes(edge_width=width), color = "lightblue", show.legend = F)+
geom_node_point(aes(size=size), color="orange", alpha=0.7)+
geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = F)+
scale_edge_width(range = c(0.2,1))+
scale_size_continuous(range = c(1,10) )+
guides(size=F)+
theme_graph()
上面的幾張圖可以看到有幾個單一的互作關係,其和主網絡並不相連,像這種互作關係可以去掉,此時出來的圖就會更加美觀。
# 去除游離的互作關係
# 如果links數據框的一個link的from只出現過一次,同時to也只出現一次,則將其去除
links_2 <- links %>% mutate(from_c = count(., from)$n[match(from, count(., from)$from)]) %>%
mutate(to_c = count(., to)$n[match(to, count(., to)$to)]) %>%
filter(!(from_c == 1 & to_c == 1)) %>%
dplyr::select(1,2,3)
# 新的節點數據
nodes_2 <- links_2 %>% { data.frame(gene = c(.$from, .$to)) } %>% distinct()
# 創建網絡圖
net_2 <- igraph::graph_from_data_frame(d=links_2,vertices=nodes_2,directed = F)
# 添加必要的參數
igraph::V(net_2)$deg <- igraph::degree(net_2)
igraph::V(net_2)$size <- igraph::degree(net_2)/5
igraph::E(net_2)$width <- igraph::E(net_2)$weight/10
如果去除了游離的互作關係,那麼可以使用一種中心布局的方式,它是根據一個節點的連接數而排列其位置,連接數越大,節點越傾向於在中間位置排列,會更容易看得出重要節點。
ggraph(net_2,layout = "centrality", cent = deg)+
geom_edge_fan(aes(edge_width=width), color = "lightblue", show.legend = F)+
geom_node_point(aes(size=size), color="orange", alpha=0.7)+
geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = T)+
scale_edge_width(range = c(0.2,1))+
scale_size_continuous(range = c(1,10) )+
guides(size=F)+
theme_graph()
另外環形布局的線使用弧形線(geom_edge_arc)會更美觀:
ggraph(net,layout = "linear", circular = TRUE)+
geom_edge_arc(aes(edge_width=width), color = "lightblue", show.legend = F)+
geom_node_point(aes(size=size), color="orange", alpha=0.7)+
geom_node_text(aes(filter=deg>5, label=name), size = 5, repel = F)+
scale_edge_width(range = c(0.2,1))+
scale_size_continuous(range = c(1,10) )+
guides(size=F)+
theme_graph()
除了使用igraph創建網絡圖外,也可以使用tidygraph的as_tbl_graph函數處理數據,然後使用ggraph繪圖:
links_2 %>% tidygraph::as_tbl_graph() %>%
ggraph(layout = "kk")+
geom_edge_fan(color = "grey")+
geom_node_point(size=5, color="blue", alpha=0.8)+
geom_node_text(aes(label=name), repel = T)+
theme_void()
參考資料
網絡數據可視化(3修訂版)ggraph:ggplot2的網絡可視化 https://mp.weixin.qq.com/s/jV8KEMudy9-L_EYwvAgYSA
文末友情推薦要想真正入門生物信息學建議務必購買全套書籍,一點一滴攻克計算機基礎知識,書單在:什麼,生信入門全套書籍僅需160 。如果大家沒有時間自行慢慢摸索著學習,可以考慮我們生信技能樹官方舉辦的學習班:
•數據挖掘學習班第5期(線上直播3周,馬拉松式陪伴,帶你入門),原價4800的數據挖掘全套課程, 疫情期間半價即可搶購。•生信爆款入門-第7期(線上直播4周,馬拉松式陪伴,帶你入門),原價9600的生信入門全套課程,疫情期間3.3折即可搶購。
如果你課題涉及到轉錄組,歡迎添加一對一客服:詳見:你還在花三五萬做一個單細胞轉錄組嗎?