TCGA數據下載 | TCGAbiolinks、gdc-client、UCSC、官網等方式下載TCGA數據

2021-02-26 Bioinformation

懶人無法享受休息之樂。

《回中作》

蒼莽寒空遠色愁，嗚嗚戍角上高樓。

吳姬怨思吹雙管，燕客悲歌別五侯。

千裡關山邊草暮，一星烽火朔雲秋。

夜來霜重西風起，隴水無聲凍不流。

--- 溫庭筠

開始正題

TCGA是美國國家癌症研究所(National Cancer Institute)和美國人類基因組研究所(National Human Genome Research Institute)共同監督的一個項目，旨在應用高通量的基因組分析技術，以幫助人們對癌症有個更好的認知，從而提高對於癌症的預防、診斷和治療能力。

作為目前最大的癌症基因信息資料庫，TCGA的全面不僅僅體現在眾多癌型上，還體現在多組學數據，包括基因表達數據、miRNA表達數據、拷貝數變異、DNA甲基化、SNP，而相對於GEO資料庫，TCGA最大的優勢是豐富且規範的臨床數據，以及針對每種癌型的大樣本量。

TCGA現在的數據均收錄在GDC中，而GDC同時也收錄了TARGET資料庫的數據，在GDC中可以通過GDC Data Portal 和 GDC Legacy Archive 這兩種方式獲得TCGA數據，GDC Data Portal 中的數據是最新經過統一標準整理的，但有些數據還未開放，而 GDC Legacy Archive 中的數據是所有未經處理的數據，更全面。

接下來將介紹四種下載TCGA數據的方法！

一、TCGAbiolinks

TCGAbiolinks 是一個用於TCGA數據綜合分析的R/BioConductor軟體包，能夠通過GDC Application Programming Interface (API)訪問 National Cancer Institute (NCI) Genomic Data Commons (GDC) ，來搜索、下載和準備TCGA相關數據，以便在R中進行分析。

BiocManager::install("TCGAbiolinks")ERROR: dependency 'purrrogress' is not available for package 'TCGAbiolinks'* removing 'D:/R/R-3.6.3/library/TCGAbiolinks'

找了很多辦法嘗試，都沒能解決問題，可能是真的沒有3.6.3適合的purrrogress

切換為4.01版本

切換之後重啟一下r Ctrl+Shift+F10

##繼續安裝R包BiocManager::install("TCGAbiolinks")library(TCGAbiolinks)

查看一下那些數據可以下載

> getGDCprojects()$id [1] "GENIE-MSK" "TCGA-UCEC" "TCGA-LGG" "TCGA-SARC" [5] "TCGA-PAAD" "TCGA-ESCA" "TCGA-PRAD" "GENIE-VICC" [9] "TCGA-LAML" "TCGA-KIRC" "TCGA-PCPG" "TCGA-HNSC" [13] "GENIE-JHU" "TCGA-OV" "TCGA-GBM" "TCGA-UCS" [17] "TCGA-MESO" "TCGA-TGCT" "TCGA-KICH" "TCGA-READ" [21] "TCGA-UVM" "TCGA-THCA" "OHSU-CNL" "GENIE-DFCI" [25] "GENIE-NKI" "GENIE-GRCC" "FM-AD" "GENIE-UHN" [29] "GENIE-MDA" "TCGA-LIHC" "TCGA-THYM" "TCGA-CHOL" [33] "TARGET-ALL-P1" "ORGANOID-PANCREATIC" "TCGA-DLBC" "TCGA-KIRP" [37] "TCGA-BLCA" "CPTAC-2" "TARGET-ALL-P3" "TARGET-CCSK" [41] "TARGET-NBL" "TARGET-AML" "TARGET-ALL-P2" "NCICCR-DLBCL" [45] "CTSP-DLBCL1" "TARGET-RT" "TARGET-OS" "TCGA-BRCA" [49] "TCGA-COAD" "TCGA-CESC" "TCGA-LUSC" "TCGA-STAD" [53] "TCGA-SKCM" "CMI-MBC" "CMI-ASC" "TCGA-LUAD" [57] "TARGET-WT" "TCGA-ACC" "BEATAML1.0-CRENOLANIB" "BEATAML1.0-COHORT" [61] "VAREPOP-APOLLO" "MMRF-COMMPASS" "WCDT-MCRPC" "CGCI-BLGSP" [65] "CGCI-HTMCP-CC" "CMI-MPC" "HCMI-CMDC" "CPTAC-3"

設置數據類型：下載胃癌的counts數據

query <- GDCquery(project = "TCGA-STAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")

使用gdc客戶端工具下載數據

> GDCdownload(query, method ="client")Downloading data for project TCGA-STAD試開URL』https:Content type 'application/zip' length 16207515 bytes (15.5 MB)downloaded 15.5 MBGDCdownload will download: 105.663384 MBExecuting GDC client with the following command:C:\Users\ly\DOCUME~1\GDC-CL~1.EXE download -m gdc_manifest.txt100% [#######################################################################]

使用API下載數據

> GDCdownload(query, method ="api")Downloading data for project TCGA-STADGDCdownload will download 407 files. A total of 105.663384 MBDownloading as: Wed_Feb_24_11_02_20_2021.tar.gzDownloading: 110 MB

昨天下載只有10k的速度，不知道為啥今天下載就非常快。每秒1m左右。應該是TCGA伺服器的問題0.0

為了避免下載失敗可以對數據進行拆分

GDCdownload(query,method = "api", files.per.chunk = 20)# 拆分下載，減少下載失敗風險

method：使用API (POST方法)或gdc客戶端工具。選擇「api」,「client」。API更快，但是下載過程中數據可能會損壞，可能需要重新執行。

檢查一下數據是否完整

> GDCdownload(query, method ="api")Downloading data for project TCGA-STADOf the 407 files for download 407 already exist.All samples have been already downloaded

讀取並整理數據

方法一：使用GDCprepare函數

TCGA_RNASeq <- GDCprepare(query, save = TRUE, save.filename = "TCGA_query.Rdata")

方法二：寫循環讀取

###dir()是列舉路徑下所有的文件名用它把所以文件顯示count_files = dir("C:/Users/ly/Documents/GDCdata/TCGA-STAD/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/",pattern = "*.htseq.counts.gz$",recursive = T)count_files[1][1] "01ebbb59-370a-439f-a2b2-3046055c43d4/14a248c3-0189-424d-8be1-8d8b39c705f0.htseq.counts.gz"count_files2 = stringr::str_split(count_files,"/",simplify = T)[,2]count_files2[1][1] "14a248c3-0189-424d-8be1-8d8b39c705f0.htseq.counts.gz"setwd("C:/Users/ly/Documents/GDCdata/TCGA-STAD/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/")

將列名改為gene和文件名

n = length(count_files)for (i in 1:n){ new_data = read.table(file =count_files[i],sep = "\t",header=F) colnames(new_data)=c("gene.name",count_files2[i]) write.table(new_data,count_files[i],sep = "\t",quote = F,col.names = T,row.names = F) }

將修改列名後的文件合併

merge_data = read.table(file =count_files[1],sep= "\t",header=T)colnames(merge_data)library(dplyr)for (i in 2:n){ new_data = read.table(file = count_files[i],sep= "\t",header=T) merge_data = full_join(merge_data,new_data,by = "gene.name") }View(merge_data)

修改行名

library(tidyverse)merge_data=column_to_rownames(merge_data,"gene.name")x=query[[1]][[1]]View(x)

改變列名

z=gsub('[-]','.',x$file_name)rownames(x)=zx1=x[colnames(merge_data),]colnames(merge_data)=x1$cases

方法三：lapply函數讀取

count_files = dir("C:/Users/ly/Documents/GDCdata/TCGA-STAD/harmonized/Transcriptome_Profiling/Gene_Expression_Quantification/",pattern = "*.htseq.counts.gz$",recursive = T)count_files[1][1] "01ebbb59-370a-439f-a2b2-3046055c43d4/14a248c3-0189-424d-8be1-8d8b39c705f0.htseq.counts.gz"ex = function(x){ result <- read.table(file.path("Gene_Expression_Quantification/",x),sep = "\t") return(result)}exp = lapply(count_files,ex)

exp <- do.call(cbind,exp)View(exp)

下載cart-json文件替換樣本ID。點擊metadat就可下載。

讀取cart-json文件

meta <- jsonlite::fromJSON("metadata.cart.2021-02-24.json")

ids <- meta$associated_entities> class(ids)[1] "list"ids[[1]][,1][1] "TCGA-BR-8366-01A-11R-2343-13"ID = sapply(ids,function(x){x[,1]})meta1 = data.frame(file_name = meta$file_name, ID = ID)

二、gdc-client

GDC提供了一個標準的基於客戶端的機制來支持高性能的數據下載和提交。GDC數據傳輸工具客戶端提供了一個命令行界面，支持GDC數據下載和提交。GDC的在線下載功能只適用於下載小的數據集,當需要下載數據量較大的TCGA數據時,必須藉助於GDC官方提供的客戶端工具gdc-client。

官網：https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

軟體有windows和Ubuntu版本，這裡我們用Ubuntu的來下載數據。

用conda來搜索一下有哪些版本

(base) root@DESKTOP-UHMHH87:~# conda search gdc-clientLoading channels: done# Name Version Build Channelgdc-client 1.3.0 py27_0 anaconda/cloud/biocondagdc-client 1.3.0 py27_1 anaconda/cloud/biocondagdc-client 1.3.0 py27h1341992_3 anaconda/cloud/biocondagdc-client 1.4.0 pyh1341992_0 anaconda/cloud/biocondagdc-client 1.5.0 py_0 anaconda/cloud/biocondagdc-client 1.6.0 py_0 anaconda/cloud/bioconda

conda創建py2.7的環境，激活環境

(base) root@DESKTOP-UHMHH87:~# conda create -n gdc python=2.7(base) root@DESKTOP-UHMHH87:~# conda activate gdc

用conda來安裝gdc-client

(gdc) root@DESKTOP-UHMHH87:~# conda install gdc-client

查看一下說明文檔

(gdc) root@DESKTOP-UHMHH87:/mnt/h/tcga# gdc-client download -hReadline features including tab completion have been disabled since nosupported version of readline was found. To resolve this, installpyreadline on Windows or gnureadline on Mac.
usage: gdc-client download [-h] [--debug] [--log-file LOG_FILE] [-t TOKEN_FILE] [-d DIR] [-s server] [--no-segment-md5sums] [--no-file-md5sum] [-n N_PROCESSES] [--http-chunk-size HTTP_CHUNK_SIZE] [--save-interval SAVE_INTERVAL] [--no-verify] [--no-related-files] [--no-annotations] [--no-auto-retry] [--retry-amount RETRY_AMOUNT] [--wait-time WAIT_TIME] [-u] [-m MANIFEST] [file_id [file_id ...]]
positional arguments: file_id The GDC UUID of the file(s) to download
optional arguments: -h, --help show this help message and exit --debug Enable debug logging. If a failure occurs, the program will stop. --log-file LOG_FILE Save logs to file. Amount logged affected by --debug -t TOKEN_FILE, --token-file TOKEN_FILE GDC API auth token file -d DIR, --dir DIR Directory to download files to. Defaults to current dir -s server, --server server The TCP server address server[:port] --no-segment-md5sums Do not calculate inbound segment md5sums and/or do not verify md5sums on restart --no-file-md5sum Do not verify file md5sum after download -n N_PROCESSES, --n-processes N_PROCESSES Number of client connections. --http-chunk-size HTTP_CHUNK_SIZE Size in bytes of standard HTTP block size. --save-interval SAVE_INTERVAL The number of chunks after which to flush state file. A lower save interval will result in more frequent printout but lower performance. --no-verify Perform insecure SSL connection and transfer --no-related-files Do not download related files. --no-annotations Do not download annotations. --no-auto-retry Ask before retrying to download a file --retry-amount RETRY_AMOUNT Number of times to retry a download --wait-time WAIT_TIME Amount of seconds to wait before retrying -u, --udt Use the UDT protocol. -m MANIFEST, --manifest MANIFEST GDC download manifest file

在TCGA官網下載 manifest file文件

https://portal.gdc.cancer.gov/

點擊Repository

在Experimental Strategy中選擇RNA-Seq

選擇數據類型 STAR - Counts

選擇文件類型 txt

點擊cases

在Project中選擇TCGA-STAD

點擊購物車（Add All Files to Cart）

最後點擊Manifest下載

將文件保存

下載數據

(gdc) root@DESKTOP-UHMHH87:/mnt/h/tcga# gdc-client download -m gdc_manifest.2021-02-23.txtReadline features including tab completion have been disabled since nosupported version of readline was found. To resolve this, installpyreadline on Windows or gnureadline on Mac.
100% [################################################################################################################] Time: 0:00:52 0.02 B/s100% [################################################################################################################] Time: 0:00:03 0.27 B/s100% [################################################################################################################] Time: 0:00:04 0.25 B/s100% [################################################################################################################] Time: 0:00:16 0.06 B/sERROR: HTTPSConnectionPool(host='api.gdc.cancer.gov', port=443): Max retries exceeded with url: /data?tarfile (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f97cf5bcc10>: Failed to establish a new connection: [Errno 101] Network is unreachable',))ERROR: An unexpected error has occurred during normal operation of the client. Please report the following exception to GDC support <support@nci-gdc.datacommons.io>.ERROR: 'NoneType' object has no attribute 'status_code'Traceback (most recent call last): File "/root/miniconda3/envs/gdc/bin/gdc-client", line 99, in <module> if args.func(args): File "/root/miniconda3/envs/gdc/lib/python2.7/site-packages/gdc_client/download/parser.py", line 111, in download small_errors, count = client.download_small_groups(smalls) File "/root/miniconda3/envs/gdc/lib/python2.7/site-packages/gdc_client/download/client.py", line 234, in download_small_groups tarfile_name, error = self._download_tarfile(s) File "/root/miniconda3/envs/gdc/lib/python2.7/site-packages/gdc_client/download/client.py", line 171, in _download_tarfile if r.status_code == requests.codes.bad:AttributeError: 'NoneType' object has no attribute 'status_code'ERROR: Exiting

下載過程時中斷了，但是還是有一部分文件下載了。google上說時軟體版本的問題。

(gdc) root@DESKTOP-UHMHH87:/mnt/h/tcga# ls0702cb89-db23-47d2-b701-540f03aaa317 4dc628db-a690-4cd2-a204-2b7b369e4f91 e41e964e-f2e4-4654-b800-4d05993024e40c0cd241-29ba-4eb7-aaaf-cb02c718de09 52fa761f-3e34-4794-ad20-dcf4f34098ee f51fffe1-cd4d-4f69-99f8-90ddd842e952116322e6-7d19-41b1-abe4-a84fa489646e 56109f9f-4a89-48df-8668-84e1810710e7 fe8ef021-6483-41a6-945e-f54e3ed9f905155bcdbc-c430-45f3-b964-17e32d04f421 620abd9f-1f6e-4883-8084-584b0d77bb7d gdc25a5c6f7-ff8c-4725-886c-6773cfda017d 7b089cd4-27e8-4e90-9c02-581b3d24f35f gdc_download_20210223_100608.718643.tar371c71d1-5f6f-4f05-baeb-82b0e7d2553f d5acd695-679a-426e-8a6e-0d4adeba91be gdc_download_20210223_102516.910847.tar39a63864-45c5-4616-bc1f-5dfab1332adb d80b7816-b2ba-417d-b20a-915c13505ec8 gdc_manifest.2021-02-23.txt400b638d-22ff-4dc9-b04f-37cb165d3998 e1c91beb-a0c0-46a5-85da-785196ec14c6

在伺服器下載py3.7的版本試試

(base) :~# conda create -n gdc python=3.7(base) :~# conda activate gdc

用conda安裝軟體

(gdc) $conda install gdc-clientCollecting package metadata: doneSolving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
- gdc-client
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64 - https://repo.anaconda.com/pkgs/main/noarch - https://repo.anaconda.com/pkgs/free/linux-64 - https://repo.anaconda.com/pkgs/free/noarch - https://repo.anaconda.com/pkgs/r/linux-64 - https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you'relooking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.

發現伺服器鏡像不對，添加中科大的鏡像

conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/main/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/pkgs/free/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/msys2/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/bioconda/conda config --add channels https://mirrors.ustc.edu.cn/anaconda/cloud/menpo/
conda config --set show_channel_urls yes

再次安裝，沒有報錯

(gdc) $conda install gdc-client

檢查一下安裝成功與否

(gdc) [sysu_whhu_2@wd9afb367-02b7-4190-88c9-32dd83c405cb-bf5776b49-ggwzb ~]$gdc-client download -husage: gdc-client download [-h] [--debug] [--log-file LOG_FILE] [--color_off] [-t TOKEN_FILE] [-d DIR] [-s server] [--no-segment-md5sums] [--no-file-md5sum] [-n N_PROCESSES] [--http-chunk-size HTTP_CHUNK_SIZE] [--save-interval SAVE_INTERVAL] [-k] [--no-related-files] [--no-annotations] [--no-auto-retry] [--retry-amount RETRY_AMOUNT] [--wait-time WAIT_TIME] [--latest] [--config FILE] [-m MANIFEST] [file_id [file_id ...]]
positional arguments: file_id The GDC UUID of the file(s) to download
optional arguments: -h, --help show this help message and exit --debug Enable debug logging. If a failure occurs, the program will stop. --log-file LOG_FILE Save logs to file. Amount logged affected by --debug --color_off Disable colored output -t TOKEN_FILE, --token-file TOKEN_FILE GDC API auth token file -d DIR, --dir DIR Directory to download files to. Defaults to current directory

下載數據：一切正常，速度還可以

(gdc) $gdc-client download -m gdc_manifest.2021-02-23.txt

windows版本可以教程參考：https://www.jianshu.com/p/bea374ce82b3

三、UCSC Xena

UCSC Xena簡介，UCSC是加利福尼亞大學聖克魯茲分校（University of California, Santa Cruz）的簡稱，UCSC Xena是他們開發的一個癌症基因組學數據分析平臺。該平臺整合了多個癌症公共資料庫的資源，比如來自TCGA, ICGC等大型癌症研究項目的數據，不僅可以下載各個數據集，還提供在線分析功能。該平臺數據的下載非常方便，也不需要註冊、申請。關鍵下載的數據都是整理好的數據，使用起來也很方便。

數據下載網址：https://xenabrowser.net/datapages/

UCSC Xena的數據非常全面，幾乎涵蓋了各大資料庫的所有數據。

TCGA的數據類型全部開放，而且還有整理好的臨床信息

基因ID轉化的數據也可以下載

推薦使用UCSC Xena來分析TCGA數據。

四、官網

進入官網，篩選方式與manifest file文件下載一樣，然後點擊購物車（紅色框）

網址：https://portal.gdc.cancer.gov/repository?facetTab=cases&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22stomach%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_format%22%2C%22value%22%3A%5B%22txt%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D

點擊Download下載

突然發現速度好快啊！比gdc軟體下載快很多倍（不過要VPN）

TCGA數據下載 | TCGAbiolinks、gdc-client、UCSC、官網等方式下載TCGA數據

相關焦點

使用TCGAbiolinks下載TCGA的數據

使用GDC在線查看TCGA數據

腫瘤基因組:TCGA數據挖掘之TCGAbiolinks

拯救你的數據下載——FTP

把tcga大計劃的CNS級別文章標題畫一個詞雲

TCGA資料庫任意腫瘤任意基因,隨意分析

TCGA CNV全攻略

即搜即用的TCGA資料庫挖掘網站匯總

泛癌全基因數據分析工具推薦:UCSC XENA

TCGA資料庫生存分析的網頁工具哪家強

TCGA RNAseq數據中FPKM與TPM轉換介紹

還在擔心衛星遙感影像數據下載的問題麼?誠心奉獻Sentinel哨兵數據高速下載方法(適用於其它衛星數據下載)

Messenger中文版官網下載

數據下載 | 探空數據自動下載&可視化基礎

藍鯨瀏覽器官網下載

最全最實用的微生物數據與資源在這裡下載

185手遊平臺手遊官網下載_185手遊平臺最新官網下載_18183手機遊戲...

蝦嗶嗶官網下載

夸克官網pc版下載

西瓜播放器app官網下載