TCGA計劃官方文章在:https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/publications
全部的標題的英文很容易提取和整理,如下:
Comprehensive genomic characterization defines human glioblastoma genes and core pathwaysIntegrated genomic analyses of ovarian carcinomaComprehensive molecular characterization of human colon and rectal cancerComprehensive molecular portraits of human breast tumoursComprehensive genomic characterization of squamous cell lung cancersIntegrated genomic characterization of endometrial carcinomaGenomic and epigenomic landscapes of adult de novo acute myeloid leukemiaComprehensive molecular characterization of clear cell renal cell carcinomaThe Cancer Genome Atlas Pan-Cancer analysis projectThe somatic genomic landscape of glioblastomaComprehensive molecular characterization of urothelial bladder carcinomaComprehensive molecular profiling of lung adenocarcinomaMultiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of originThe Somatic Genomic Landscape of Chromophobe Renal Cell CarcinomaComprehensive molecular characterization of gastric adenocarcinomaIntegrated genomic characterization of papillary thyroid carcinomaComprehensive genomic characterization of head and neck squamous cell carcinomasGenomic Classification of Cutaneous MelanomaComprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade GliomasComprehensive Molecular Portraits of Invasive Lobular Breast CancerThe Molecular Taxonomy of Primary Prostate CancerComprehensive Molecular Characterization of Papillary Renal-Cell CarcinomaComprehensive Pan-Genomic Characterization of Adrenocortical CarcinomaDistinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomasIntegrated genomic characterization of oesophageal carcinomaComprehensive Molecular Characterization of Pheochromocytoma and ParagangliomaIntegrated Molecular Characterization of Uterine CarcinosarcomaIntegrative Genomic Analysis of Cholangiocarcinoma Identifies Distinct IDH-Mutant Molecular ProfilesIntegrated genomic and molecular characterization of cervical cancerComprehensive and Integrative Genomic Characterization of Hepatocellular CarcinomaIntegrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal MelanomaIntegrated Genomic Characterization of Pancreatic Ductal AdenocarcinomaComprehensive Molecular Characterization of Muscle-Invasive Bladder CancerComprehensive and Integrated Genomic Characterization of Adult Soft Tissue SarcomasThe Integrated Genomic Landscape of Thymic Epithelial TumorsPan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome AtlasScalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic PipelinesMolecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human CancersSystematic Analysis of Splice-Site-Creating Mutations in CancerSomatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer TypesThe Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell CarcinomaPan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor ContextSpatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology ImagesMachine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome AtlasGenomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome AtlasDriver Fusions and Their Implications in the Development and Treatment of Human CancersGenomic, Pathway Network, and Immunologic Features Distinguishing Squamous CarcinomasIntegrated Genomic Analysis of the Ubiquitin Pathway across Cancer TypesSnapShot: TCGA-Analyzed TumorsThe Cancer Genome Atlas: Creating Lasting Value beyond Its DataMachine Learning Identifies Stemness Features Associated with Oncogenic DedifferentiationOncogenic Signaling Pathways in The Cancer Genome AtlasPerspective on Oncogenic Processes at the End of the Beginning of Cancer GenomicsComprehensive Characterization of Cancer Driver Genes and MutationsAn Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome AnalyticsPathogenic Germline Variants in 10,389 Adult CancersA Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient SamplesGenomic and Functional Approaches to Understanding Cancer AneuploidyA Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast CancersComparative Molecular Analysis of Gastrointestinal AdenocarcinomaslncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in CancerThe Immune Landscape of CancerIntegrated Molecular Characterization of Testicular Germ Cell TumorsComprehensive Analysis of Alternative Splicing Across Tumors from 8,705 PatientsA Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-β SuperfamilyIntegrative Molecular Characterization of Malignant Pleural MesotheliomaThe chromatin accessibility landscape of primary human cancersComprehensive Molecular Characterization of the Hippo Signaling Pathway in CancerBefore and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons』 DataComprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in Cancer簡單的使用bing搜索一下關鍵詞:word clound in r ,就可以找到解決方案,第一個連結就是:http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know,代碼分成5個步驟。
Step 1: Create a text fileStep 2 : Install and load the required packagesStep 4 : Build a term-document matrixStep 5 : Generate the Word cloud一般來說,會R基礎的朋友們很容易看懂,如果你還不會R語言,建議看:
把R的知識點路線圖搞定,如下:
核心代碼就是wordcloud函數,但是這個wordcloud函數要求的輸入數據就需要認真做出來。
# 安裝R包相信無需再強調了library("tm")library("SnowballC")library("wordcloud")library("RColorBrewer")# 這裡我們直接讀取自己電腦剪切的數據即可# 運行下面這句代碼的同時,需要保證你已經複製了前面我們整理好的文章標題哦!text=readLines(pipe("pbpaste"))# 好像這裡Mac系統跟Windows系統稍微不一樣,大家需要自行把握# Load the data as a corpusdocs <- Corpus(VectorSource(text))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "@")docs <- tm_map(docs, toSpace, "\\|")# Convert the text to lower casedocs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# Remove english common stopwordsdocs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) # Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE)d <- data.frame(word = names(v),freq=v)head(d, 10)set.seed(1234)wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))詞雲繪圖結果每次布局都不一樣哦,如下所示:
image-20200819181252785其實就是把詞頻給可視化了一下:
> head(d, 10) word freq1 characterization 252 molecular 253 genomic 244 cancer 235 comprehensive 226 analysis 137 integrated 128 carcinoma 119 cell 810 genome 8出現次數很多的單詞,在詞雲就顯示大一點,僅此而已。
在三年前我就整理並且製作了TCGA腫瘤資料庫知識圖譜視頻教程,一年半前免費公布在生信技能樹的B站,現在勉勉強強也快有兩萬的觀看量。
閱讀量如下:
視頻目錄是:
P1-TCGA-101-課程介紹-需要哪些背景知識
P2-TCGA-102-課程導讀-如何使用我的github代碼
P3-TCGA-103--TCGA資料庫大有作用-不僅僅是灌水
P4-TCGA-201-背景介紹及網頁工具大全
P5-TCGA-202-其它資料庫介紹
P6-TCGA-203-使用Xena網頁工具
P7-TCGA-204-使用firehose網頁工具
P8-TCGA-205-文章規律講解
P9-TCGA-301-數據下載方式導言
P10-TCGA-302-GDC下載數據實戰
P11-TCGA-303-GDC數據整理
P12-TCGA-304-GDC下載數據續集
P13-TCGA-305-R-TCGA包下載數據及數據提取
P14-TCGA-306-使用GDC和firehose下載-TCGA的胃癌的甲基化信息數據
P15-TCGA-307-使用GDC和Xena下載RNA-Seq的表達矩陣並且比較
我們生信技能樹團隊優秀R語言講師《小潔》也學完了我的全套視頻,在她自己的理解的基礎上面,也給大家奉獻了一套筆記:
小潔的筆記細數下來,寫了17篇TCGA相關的筆記,現對其進行完整梳理,一篇年度精品推文橫空出世。再次重申:本系列是我的TCGA學習記錄,跟著生信技能樹B站課程學的,已獲得授權。課程連結:https://www.bilibili.com/video/av49363776
一、數據下載1.官方工具GDC需要去官網下載對應系統版本的GDC軟體,存放在工作目錄下。
2.R包TCGA-biolinks
關於這個工具前後寫了三篇:
(1)GDC數據下載
(2)GDC數據整理
(3)GDC數據整理續集
這個方法需要紮實的的linux命令行和R語言基礎,僅僅是理解代碼,也是需要花費一些時間的。R包TCGAbiolinks下載數據
3.R包RTCGA
這是一個完全基於R語言的流程,下載的是最新的數據,其實還是基於GDC,更加集成化,操作簡單很多,除了參數研究比較費時間,沒有發現什麼缺點。使用RTCGA包獲取數據
這是一個資料庫式的包,把所有數據都包裝進去了,導致包很大,不是最新的數據,但最簡單。總結一下這三種方法,都是分別下載表達矩陣和meta信息,但由於有的病人既有腫瘤樣本,又有正常樣本,導致他們並非是一一對應的關係,需要一定的R語言技巧。
二.差異分析TCGA(轉錄組)差異分析三大R包及其結果對比
三.生存分析
使用轉錄組三大R包deseq2,limma和edgeR分別進行差異分析兩種方法批量做TCGA生存分析
四.生存模型構建
單個基因的生存分析可視化是很簡單的,有非常好的R包可用,畫出來的圖要顏值有顏值,要內涵有內涵。課程中共使用了四種算法構建模型:
不管用了那種算法,核心都只是幾句代碼.