有時候,大家做實驗以小鼠為模型,但希望查看與之對應的人同源基因。像這種情況,我們可以不需要進行序列比對來查找,因為比較麻煩。使用公共數據可能更高效。
1.基於NCBI HomoloGene資料庫查找物種間對應的同源基因NCBI HomoloGene資料庫收集了部分已經完成基因組測序物種的同源基因數據。資料庫現包含21個物種,共44233組同源基因;
HomoloGene的數據是開放的:FTPhomologene.data存放著同源基因的對應關係
HID(HomoloGene group id)Taxonomy IDGene IDGene SymbolProtein giProtein accession3960634ACADM160961497NP_001104286.139598469356ACADM109008502XP_001101274.131009011364Acadm6680618NP_031408.1每個物種都有一個對應的Taxonomy ID:
10090 Mus musculus
10116 Rattus norvegicus
28985 Kluyveromyces lactis
318829 Magnaporthe oryzae
33169 Eremothecium gossypii
3702 Arabidopsis thaliana
4530 Oryza sativa
4896 Schizosaccharomyces pombe
4932 Saccharomyces cerevisiae
5141 Neurospora crassa
6239 Caenorhabditis elegans
7165 Anopheles gambiae
7227 Drosophila melanogaster
7955 Danio rerio
8364 Xenopus (Silurana) tropicalis
9031 Gallus gallus
9544 Macaca mulatta
9598 Pan troglodytes
9606 Homo sapiens
9615 Canis lupus familiaris
9913 Bos taurus單個基因直接檢索,如Acadm:
批量注釋某個物種的基因對應另一個物種的同源基因,可以使用R包homologene,它調用的是c中build68的數據;
homologene(genes, inTax, outTax)
genes:需要查找同源基因的基因列表
inTax:輸入基因所屬物種
outTax:查找的同源基因屬於那個物種例子:
genelist<-c("Acadm","Eno2","Acadvl")
homologene(genelist, inTax = 10090, outTax = 9606)
10090 9606 10090_ID 9606_ID
1 Eno2 ENO2 13807 2026
2 Mog MOG 17441 4340查看homologene使用的數據版本
homologeneVersion
[1] 68基於InParanoid 8資料庫查找物種間對應的同源基因InParanoid 8提供的下載數據是Protein ID;構建g InParanoid 8 用到的InParanoid 4.1可以獲取的,InParanoid 4.1 standalone download
這兒我們利用InParanoid 8提供的同源基因信息進行一個快速檢索。
根據自己研究的物種,從Downloads中下載數據;8.0_current;需要值得注意的是,人類與老鼠的同源基因文件InParanoid.H.sapiens-M.musculus.tgz 存放於H.sapiens/ ;在M.musculus/ 不會存在InParanoid.M.musculus-H.sapiens.tgz;其它類似,所以要根據物種名首字母排序去排名靠前的物種文件夾下去找同源基因集文件。
InParanoid.H.sapiens-M.musculus.tgz 下載後解壓:
這兒使用文件,格式如下:sqltable.H.sapiens-M.musculus
數據格式和前面的NCBI HomoloGene中的homologene.data差不多;使用R處理數據時,模仿了homologene包代碼;
homologene.R的代碼
homologene = function(genes, inTax, outTax){
genes <- unique(genes) #remove duplicates
out = homologene::homologeneData %>%
dplyr::filter(Taxonomy %in% inTax & (Gene.Symbol %in% genes | Gene.ID %in% genes)) %>%
dplyr::select(HID,Gene.Symbol,Gene.ID)
names(out)[2] = inTax
names(out)[3] = paste0(inTax,'_ID')
out2 = homologene::homologeneData %>% dplyr::filter(Taxonomy %in% outTax & HID %in% out$HID) %>%
dplyr::select(HID,Gene.Symbol,Gene.ID)
names(out2)[2] = outTax
names(out2)[3] = paste0(outTax,'_ID')
output = merge(out,out2) %>% dplyr::select(2,4,3,5)
# preserve order with temporary column
output$sortBy <- factor(output[,1], levels = genes)
output <- dplyr::arrange(output, sortBy)
output$sortBy <- NULL
return(output)
}仿寫的函數InParanoid_homo():
Hs.Mm<-read.table("sqltable.H.sapiens-M.musculus",sep = "\t",fill = T)
genes<-c("Q8WZ42","A2ASS6")
trans<-InParanoid_homo(genes,Hs.Mm)
InParanoid_homo = function(genes,database){
colnames(database)<-c("Group","score","spieces","num","gene","Bootstrap")
genes <- unique(genes)
Spieces_name1<-database[1,]$spieces
Spieces_name2<-database[2,]$spieces
Spieces_1<-database %>% dplyr::filter(spieces %in% Spieces_name1)
Spieces_2<-database %>% dplyr::filter(spieces %in% Spieces_name2)
if(ANSWER <- readline(paste("Transfer",Spieces_name1,"to",Spieces_name2,"?","True/False: "))){
genes_query<-Spieces_1 %>% dplyr::filter(gene %in% genes);head(genes_query)
output = merge(genes_query,Spieces_2,by="Group")[,]
}else if(ANSWER <- readline(paste("Transfer",Spieces_name2,"to",Spieces_name1,"?","True/False: "))){
genes_query<-Spieces_2 %>% dplyr::filter(gene %in% genes);head(genes_query)
output = merge(genes_query,Spieces_1,by="Group")[,]
}else{
cat("Nothing for you.")
}
return(output)
}參考:InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic homologene reference manual