懶人無法享受休息之樂。
《回中作》
蒼莽寒空遠色愁,嗚嗚戍角上高樓。
吳姬怨思吹雙管,燕客悲歌別五侯。
千裡關山邊草暮,一星烽火朔雲秋。
夜來霜重西風起,隴水無聲凍不流。
--- 溫庭筠
開始正題
TCGA是美國國家癌症研究所(National Cancer Institute)和美國人類基因組研究所(National Human Genome Research Institute)共同監督的一個項目,旨在應用高通量的基因組分析技術,以幫助人們對癌症有個更好的認知,從而提高對於癌症的預防、診斷和治療能力。
作為目前最大的癌症基因信息資料庫,TCGA的全面不僅僅體現在眾多癌型上,還體現在多組學數據,包括基因表達數據、miRNA表達數據、拷貝數變異、DNA甲基化、SNP,而相對於GEO資料庫,TCGA最大的優勢是豐富且規範的臨床數據,以及針對每種癌型的大樣本量。
TCGA現在的數據均收錄在GDC中,而GDC同時也收錄了TARGET資料庫的數據,在GDC中可以通過GDC Data Portal 和 GDC Legacy Archive 這兩種方式獲得TCGA數據,GDC Data Portal 中的數據是最新經過統一標準整理的,但有些數據還未開放,而 GDC Legacy Archive 中的數據是所有未經處理的數據,更全面。
接下來將介紹四種下載TCGA數據的方法!
TCGAbiolinks 是一個用於TCGA數據綜合分析的R/BioConductor軟體包,能夠通過GDC Application Programming Interface (API)訪問 National Cancer Institute (NCI) Genomic Data Commons (GDC) ,來搜索、下載和準備TCGA相關數據,以便在R中進行分析。
BiocManager::install("TCGAbiolinks")ERROR: dependency 'purrrogress' is not available for package 'TCGAbiolinks'* removing 'D:/R/R-3.6.3/library/TCGAbiolinks'找了很多辦法嘗試,都沒能解決問題,可能是真的沒有3.6.3適合的purrrogress
切換為4.01版本
切換之後重啟一下r Ctrl+Shift+F10
##繼續安裝R包BiocManager::install("TCGAbiolinks")library(TCGAbiolinks)查看一下那些數據可以下載
> getGDCprojects()$id [1] "GENIE-MSK" "TCGA-UCEC" "TCGA-LGG" "TCGA-SARC" [5] "TCGA-PAAD" "TCGA-ESCA" "TCGA-PRAD" "GENIE-VICC" [9] "TCGA-LAML" "TCGA-KIRC" "TCGA-PCPG" "TCGA-HNSC" [13] "GENIE-JHU" "TCGA-OV" "TCGA-GBM" "TCGA-UCS" [17] "TCGA-MESO" "TCGA-TGCT" "TCGA-KICH" "TCGA-READ" [21] "TCGA-UVM" "TCGA-THCA" "OHSU-CNL" "GENIE-DFCI" [25] "GENIE-NKI" "GENIE-GRCC" "FM-AD" "GENIE-UHN" [29] "GENIE-MDA" "TCGA-LIHC" "TCGA-THYM" "TCGA-CHOL" [33] "TARGET-ALL-P1" "ORGANOID-PANCREATIC" "TCGA-DLBC" "TCGA-KIRP" [37] "TCGA-BLCA" "CPTAC-2" "TARGET-ALL-P3" "TARGET-CCSK" [41] "TARGET-NBL" "TARGET-AML" "TARGET-ALL-P2" "NCICCR-DLBCL" [45] "CTSP-DLBCL1" "TARGET-RT" "TARGET-OS" "TCGA-BRCA" [49] "TCGA-COAD" "TCGA-CESC" "TCGA-LUSC" "TCGA-STAD" [53] "TCGA-SKCM" "CMI-MBC" "CMI-ASC" "TCGA-LUAD" [57] "TARGET-WT" "TCGA-ACC" "BEATAML1.0-CRENOLANIB" "BEATAML1.0-COHORT" [61] "VAREPOP-APOLLO" "MMRF-COMMPASS" "WCDT-MCRPC" "CGCI-BLGSP" [65] "CGCI-HTMCP-CC" "CMI-MPC" "HCMI-CMDC" "CPTAC-3"設置數據類型:下載胃癌的counts數據
query <- GDCquery(project = "TCGA-STAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")使用gdc客戶端工具下載數據
> GDCdownload(query, method ="client")Downloading data for project TCGA-STAD試開URL』https:Content type 'application/zip' length 16207515 bytes (15.5 MB)downloaded 15.5 MBGDCdownload will download: 105.663384 MBExecuting GDC client with the following command:C:\Users\ly\DOCUME~1\GDC-CL~1.EXE download -m gdc_manifest.txt100% [#######################################################################]使用API下載數據
> GDCdownload(query, method ="api")Downloading data for project TCGA-STADGDCdownload will download 407 files. A total of 105.663384 MBDownloading as: Wed_Feb_24_11_02_20_2021.tar.gzDownloading: 110 MB昨天下載只有10k的速度,不知道為啥今天下載就非常快。每秒1m左右。應該是TCGA伺服器的問題0.0
為了避免下載失敗可以對數據進行拆分
GDCdownload(query,method = "api", files.per.chunk = 20)# 拆分下載,減少下載失敗風險method:使用API (POST方法)或gdc客戶端工具。選擇「api」,「client」。API更快,但是下載過程中數據可能會損壞,可能需要重新執行。
檢查一下數據是否完整
> GDCdownload(query, method ="api")Downloading data for project TCGA-STADOf the 407 files for download 407 already exist.All samples have been already downloaded讀取並整理數據
方法一:使用GDCprepare函數
TCGA_RNASeq <- GDCprepare(query, save = TRUE, save.filename = "TCGA_query.Rdata")方法二:寫循環讀取
###dir()是列舉路徑下所有的文件名 用它把所以文件顯示count_files = dir("C:/Users/ly/Documents/GDCdata/TCGA-STAD/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/",pattern = "*.htseq.counts.gz$",recursive = T)count_files[1][1] "01ebbb59-370a-439f-a2b2-3046055c43d4/14a248c3-0189-424d-8be1-8d8b39c705f0.htseq.counts.gz"count_files2 = stringr::str_split(count_files,"/",simplify = T)[,2]count_files2[1][1] "14a248c3-0189-424d-8be1-8d8b39c705f0.htseq.counts.gz"setwd("C:/Users/ly/Documents/GDCdata/TCGA-STAD/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/")將列名改為gene和文件名
n = length(count_files)for (i in 1:n){ new_data = read.table(file =count_files[i],sep = "\t",header=F) colnames(new_data)=c("gene.name",count_files2[i]) write.table(new_data,count_files[i],sep = "\t",quote = F,col.names = T,row.names = F) }將修改列名後的文件合併
merge_data = read.table(file =count_files[1],sep= "\t",header=T)colnames(merge_data)library(dplyr)for (i in 2:n){ new_data = read.table(file = count_files[i],sep= "\t",header=T) merge_data = full_join(merge_data,new_data,by = "gene.name") }View(merge_data)修改行名
library(tidyverse)merge_data=column_to_rownames(merge_data,"gene.name")x=query[[1]][[1]]View(x)改變列名
z=gsub('[-]','.',x$file_name)rownames(x)=zx1=x[colnames(merge_data),]colnames(merge_data)=x1$cases方法三:lapply函數讀取
count_files = dir("C:/Users/ly/Documents/GDCdata/TCGA-STAD/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/",pattern = "*.htseq.counts.gz$",recursive = T)count_files[1][1] "01ebbb59-370a-439f-a2b2-3046055c43d4/14a248c3-0189-424d-8be1-8d8b39c705f0.htseq.counts.gz"ex = function(x){ result <- read.table(file.path("Gene_Expression_Quantification/",x),sep = "\t") return(result)}exp = lapply(count_files,ex)exp <- do.call(cbind,exp)View(exp)下載cart-json文件替換樣本ID。點擊metadat就可下載。
讀取cart-json文件
meta <- jsonlite::fromJSON("metadata.cart.2021-02-24.json")ids <- meta$associated_entities> class(ids)[1] "list"ids[[1]][,1][1] "TCGA-BR-8366-01A-11R-2343-13"ID = sapply(ids,function(x){x[,1]})meta1 = data.frame(file_name = meta$file_name, ID = ID)二、gdc-clientGDC提供了一個標準的基於客戶端的機制來支持高性能的數據下載和提交。GDC數據傳輸工具客戶端提供了一個命令行界面,支持GDC數據下載和提交。GDC的在線下載功能只適用於下載小的數據集,當需要下載數據量較大的TCGA數據時,必須藉助於GDC官方提供的客戶端工具gdc-client。
官網:https://gdc.cancer.gov/access-data/gdc-data-transfer-tool軟體有windows和Ubuntu版本,這裡我們用Ubuntu的來下載數據。
用conda來搜索一下有哪些版本
(base) root@DESKTOP-UHMHH87:~# conda search gdc-clientLoading channels: done# Name Version Build Channelgdc-client 1.3.0 py27_0 anaconda/cloud/biocondagdc-client 1.3.0 py27_1 anaconda/cloud/biocondagdc-client 1.3.0 py27h1341992_3 anaconda/cloud/biocondagdc-client 1.4.0 pyh1341992_0 anaconda/cloud/biocondagdc-client 1.5.0 py_0 anaconda/cloud/biocondagdc-client 1.6.0 py_0 anaconda/cloud/biocondaconda創建py2.7的環境,激活環境
(base) root@DESKTOP-UHMHH87:~# conda create -n gdc python=2.7(base) root@DESKTOP-UHMHH87:~# conda activate gdc用conda來安裝gdc-client
(gdc) root@DESKTOP-UHMHH87:~# conda install gdc-client查看一下說明文檔
(gdc) root@DESKTOP-UHMHH87:/mnt/h/tcga# gdc-client download -hReadline features including tab completion have been disabled since nosupported version of readline was found. To resolve this, installpyreadline on Windows or gnureadline on Mac.
usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE] [-t TOKEN_FILE] [-d DIR] [-s server] [--no-segment-md5sums] [--no-file-md5sum] [-n N_PROCESSES] [--http-chunk-size HTTP_CHUNK_SIZE] [--save-interval SAVE_INTERVAL] [--no-verify] [--no-related-files] [--no-annotations] [--no-auto-retry] [--retry-amount RETRY_AMOUNT] [--wait-time WAIT_TIME] [-u] [-m MANIFEST] [file_id [file_id ...]]
positional arguments: file_id The GDC UUID of the file(s) to download
optional arguments: -h, --help show this help message and exit --debug Enable debug logging. If a failure occurs, the program will stop. --log-file LOG_FILE Save logs to file. Amount logged affected by --debug -t TOKEN_FILE, --token-file TOKEN_FILE GDC API auth token file -d DIR, --dir DIR Directory to download files to. Defaults to current dir -s server, --server server The TCP server address server[:port] --no-segment-md5sums Do not calculate inbound segment md5sums and/or do not verify md5sums on restart --no-file-md5sum Do not verify file md5sum after download -n N_PROCESSES, --n-processes N_PROCESSES Number of client connections. --http-chunk-size HTTP_CHUNK_SIZE Size in bytes of standard HTTP block size. --save-interval SAVE_INTERVAL The number of chunks after which to flush state file. A lower save interval will result in more frequent printout but lower performance. --no-verify Perform insecure SSL connection and transfer --no-related-files Do not download related files. --no-annotations Do not download annotations. --no-auto-retry Ask before retrying to download a file --retry-amount RETRY_AMOUNT Number of times to retry a download --wait-time WAIT_TIME Amount of seconds to wait before retrying -u, --udt Use the UDT protocol. -m MANIFEST, --manifest MANIFEST GDC download manifest file在TCGA官網下載 manifest file文件
https://portal.gdc.cancer.gov/點擊Repository
在Experimental Strategy中選擇RNA-Seq
選擇數據類型 STAR - Counts
選擇文件類型 txt
點擊cases
在Project中選擇TCGA-STAD
點擊購物車(Add All Files to Cart)
最後點擊Manifest下載
將文件保存
下載數據
(gdc) root@DESKTOP-UHMHH87:/mnt/h/tcga# gdc-client download -m gdc_manifest.2021-02-23.txtReadline features including tab completion have been disabled since nosupported version of readline was found. To resolve this, installpyreadline on Windows or gnureadline on Mac.
100% [################################################################################################################] Time: 0:00:52 0.02 B/s100% [################################################################################################################] Time: 0:00:03 0.27 B/s100% [################################################################################################################] Time: 0:00:04 0.25 B/s100% [################################################################################################################] Time: 0:00:16 0.06 B/sERROR: HTTPSConnectionPool(host='api.gdc.cancer.gov', port=443): Max retries exceeded with url: /data?tarfile (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f97cf5bcc10>: Failed to establish a new connection: [Errno 101] Network is unreachable',))ERROR: An unexpected error has occurred during normal operation of the client. Please report the following exception to GDC support <support@nci-gdc.datacommons.io>.ERROR: 'NoneType' object has no attribute 'status_code'Traceback (most recent call last): File "/root/miniconda3/envs/gdc/bin/gdc-client", line 99, in <module> if args.func(args): File "/root/miniconda3/envs/gdc/lib/python2.7/site-packages/gdc_client/download/parser.py", line 111, in download small_errors, count = client.download_small_groups(smalls) File "/root/miniconda3/envs/gdc/lib/python2.7/site-packages/gdc_client/download/client.py", line 234, in download_small_groups tarfile_name, error = self._download_tarfile(s) File "/root/miniconda3/envs/gdc/lib/python2.7/site-packages/gdc_client/download/client.py", line 171, in _download_tarfile if r.status_code == requests.codes.bad:AttributeError: 'NoneType' object has no attribute 'status_code'ERROR: Exiting下載過程時中斷了,但是還是有一部分文件下載了。google上說時軟體版本的問題。
(gdc) root@DESKTOP-UHMHH87:/mnt/h/tcga# ls0702cb89-db23-47d2-b701-540f03aaa317 4dc628db-a690-4cd2-a204-2b7b369e4f91 e41e964e-f2e4-4654-b800-4d05993024e40c0cd241-29ba-4eb7-aaaf-cb02c718de09 52fa761f-3e34-4794-ad20-dcf4f34098ee f51fffe1-cd4d-4f69-99f8-90ddd842e952116322e6-7d19-41b1-abe4-a84fa489646e 56109f9f-4a89-48df-8668-84e1810710e7 fe8ef021-6483-41a6-945e-f54e3ed9f905155bcdbc-c430-45f3-b964-17e32d04f421 620abd9f-1f6e-4883-8084-584b0d77bb7d gdc25a5c6f7-ff8c-4725-886c-6773cfda017d 7b089cd4-27e8-4e90-9c02-581b3d24f35f gdc_download_20210223_100608.718643.tar371c71d1-5f6f-4f05-baeb-82b0e7d2553f d5acd695-679a-426e-8a6e-0d4adeba91be gdc_download_20210223_102516.910847.tar39a63864-45c5-4616-bc1f-5dfab1332adb d80b7816-b2ba-417d-b20a-915c13505ec8 gdc_manifest.2021-02-23.txt400b638d-22ff-4dc9-b04f-37cb165d3998 e1c91beb-a0c0-46a5-85da-785196ec14c6在伺服器 下載py3.7的版本試試
(base) :~# conda create -n gdc python=3.7(base) :~# conda activate gdc用conda安裝軟體
(gdc) $conda install gdc-clientCollecting package metadata: doneSolving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
- gdc-client
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64 - https://repo.anaconda.com/pkgs/main/noarch - https://repo.anaconda.com/pkgs/free/linux-64 - https://repo.anaconda.com/pkgs/free/noarch - https://repo.anaconda.com/pkgs/r/linux-64 - https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you'relooking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.發現伺服器鏡像不對,添加中科大的鏡像
conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes再次安裝,沒有報錯
(gdc) $conda install gdc-client檢查一下安裝成功與否
(gdc) [sysu_whhu_2@wd9afb367-02b7-4190-88c9-32dd83c405cb-bf5776b49-ggwzb ~]$gdc-client download -husage: gdc-client download [-h] [--debug] [--log-file LOG_FILE] [--color_off] [-t TOKEN_FILE] [-d DIR] [-s server] [--no-segment-md5sums] [--no-file-md5sum] [-n N_PROCESSES] [--http-chunk-size HTTP_CHUNK_SIZE] [--save-interval SAVE_INTERVAL] [-k] [--no-related-files] [--no-annotations] [--no-auto-retry] [--retry-amount RETRY_AMOUNT] [--wait-time WAIT_TIME] [--latest] [--config FILE] [-m MANIFEST] [file_id [file_id ...]]
positional arguments: file_id The GDC UUID of the file(s) to download
optional arguments: -h, --help show this help message and exit --debug Enable debug logging. If a failure occurs, the program will stop. --log-file LOG_FILE Save logs to file. Amount logged affected by --debug --color_off Disable colored output -t TOKEN_FILE, --token-file TOKEN_FILE GDC API auth token file -d DIR, --dir DIR Directory to download files to. Defaults to current directory下載數據 :一切正常,速度還可以
(gdc) $gdc-client download -m gdc_manifest.2021-02-23.txtwindows版本可以教程參考:https://www.jianshu.com/p/bea374ce82b3
三、UCSC XenaUCSC Xena簡介,UCSC是加利福尼亞大學聖克魯茲分校(University of California, Santa Cruz)的簡稱,UCSC Xena是他們開發的一個癌症基因組學數據分析平臺。該平臺整合了多個癌症公共資料庫的資源,比如來自TCGA, ICGC等大型癌症研究項目的數據,不僅可以下載各個數據集,還提供在線分析功能。該平臺數據的下載非常方便,也不需要註冊、申請。關鍵下載的數據都是整理好的數據,使用起來也很方便。
數據下載網址:https://xenabrowser.net/datapages/UCSC Xena的數據非常全面,幾乎涵蓋了各大資料庫的所有數據。
TCGA的數據類型全部開放,而且還有整理好的臨床信息
基因ID轉化的數據也可以下載
推薦使用UCSC Xena來分析TCGA數據。
四、官網進入官網,篩選方式與manifest file文件下載一樣,然後點擊購物車(紅色框)
網址:https://portal.gdc.cancer.gov/repository?facetTab=cases&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22stomach%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_format%22%2C%22value%22%3A%5B%22txt%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D點擊Download下載
突然發現速度好快啊!比gdc軟體下載快很多倍(不過要VPN)