今天介紹一個同門師兄開發的 Python 模塊:pyfastx,用於快速隨機訪問基因組序列文件。作品發表在生信頂刊上,必須強行安利一波。
師兄任職於成都大學,專注於生物信息學研究,是真正的Too maker。
Pyfastx: a robust python module for fast random access to sequences from plain and gzipped FASTA/Q file
文章地址:https://doi.org/10.1093/bib/bbaa368
代碼地址:https://github.com/lmdu/pyfastx
模塊的使用文檔非常詳細,感興趣的朋友可以直接到 pyfastx 官網查看使用說明。
本文僅作拋磚引玉,首先我們來看一下 pyfastx 的特點。
功能很多,覆蓋了平時序列文件操作的常見需求。Pyfastx 內部含有多個功能模塊,比如:
FASTX 接口,為迭代 Fasta/q 文件提供統一的接口FASTA 接口,迭代或隨機訪問 Fasta 文件FASTQ 接口 ,迭代或隨機訪問 Fastq 文件Sequence 類,提供 Fasta 記錄的常用操作安裝目前,pyfastx 支持 Python 3.5 以上的版本,通過pip即可安裝。
pip install pyfastx
FASTX 模塊FASTA 文件迭代迭代 Fasta 文件時,返回一個元組(name, seq, comment),其中 comment 是標題欄第一個空格後面的內容。
>>> fa = pyfastx.Fastx('tests/data/test.fa.gz')
>>> for name,seq,comment in fa:
>>> print(name)
>>> print(seq)
>>> print(comment)
>>> #always output uppercase sequence
>>> for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>> print(item)
>>> #Manually specify sequence format
>>> for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
>>> print(item)
FASTQ 文件迭代迭代 Fastq 文件時,返回一個元組(name, seq, qual, comment),其中 comment 是標題欄第一個空格後面的內容。
>>> fq = pyfastx.Fastx('tests/data/test.fq.gz')
>>> for name,seq,qual,comment in fq:
>>> print(name)
>>> print(seq)
>>> print(qual)
>>> print(comment)
FASTA 模塊讀取 Fasta 文件,並且支持隨機訪問其中的任意序列。
這裡要說明一下順序迭代和隨機讀取的區別。順序迭代顧名思義就是從一個文件的開始逐條記錄往後讀,直至最後一條記錄。
隨機讀取就是能夠直接訪問指定的序列,不需要從頭讀到尾。怎麼實現呢?一般是先要為序列建立索引,pyfastx 也不例外。
FASTA 文件讀取>>> import pyfastx
>>> fa = pyfastx.Fasta('test/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs
FASTA 文件迭代Fasta 文件中每條序列最重要的就是名稱和序列信息了,這兩個信息可以方便地通過迭代返回。
>>> import pyfastx
>>> for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):
>>> print(name, seq)此外也可以直接返回 FASTA 對象。
>>> import pyfastx
>>> for seq in pyfastx.Fasta('test/data/test.fa.gz'):
>>> print(seq.name)
>>> print(seq.seq)
>>> print(seq.description)
FASTA 類FASTA 對象有許多屬性和方法可供使用,如計算 GC 含量、計算 N50/L50、提取任意序列等。
以提取指定序列為例,FASTA 不僅可以提取指定序列,還可以指定序列的某一區間。
>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'
>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'
>>> # get subsequences with reverse strand
>>> fa.fetch('JZ822577.1', (1, 10), strand='-')
'ATCTCTAGAG'當然,FASTA 對象還有更加花式的序列提取方式。
>>> # get sequence like a dictionary by identifier
>>> s1 = fa['JZ822577.1']
>>> s1
<Sequence> JZ822577.1 with length of 333
>>> # get sequence like a list by index
>>> s2 = fa[2]
>>> s2
<Sequence> JZ822579.1 with length of 176
>>> # get last sequence
>>> s3 = fa[-1]
>>> s3
<Sequence> JZ840318.1 with length of 134
>>> # check a sequence name weather in FASTA file
>>> 'JZ822577.1' in fa
True
Sequence 類FASTA 文件的記錄,被封裝成 Sequence 類,為操作 FASTA 文件的記錄提供接口。例如,你可以對它進行切片。
>>> # get a sub seq from sequence
>>> s = fa[-1]
>>> ss = s[10:30]
>>> ss
<Sequence> JZ840318.1 from 11 to 30
>>> ss.name
'JZ840318.1:11-30'
>>> ss.seq
'CTTCTTCCTGTGGAAAGTAA'
>>> ss = s[-10:]
>>> ss
<Sequence> JZ840318.1 from 125 to 134
>>> ss.name
'JZ840318.1:125-134'
>>> ss.seq
'CCATGTTGGT'或者對 Sequence 進行反向、互補或反向互補。
>>> # get sliced sequence
>>> fa[0][10:20].seq
'GTCAATTTCC'
>>> # get reverse of sliced sequence
>>> fa[0][10:20].reverse
'CCTTTAACTG'
>>> # get complement of sliced sequence
>>> fa[0][10:20].complement
'CAGTTAAAGG'
>>> # get reversed complement sequence, corresponding to sequence in antisense strand
>>> fa[0][10:20].antisense
'GGAAATTGAC'
FASTQ 模塊FASTQ 文件讀取讀取 Fastq 文件,並支持隨機訪問,前提是先要構建索引。
>>> import pyfastx
>>> fq = pyfastx.Fastq('tests/data/test.fq.gz')
>>> fq
<Fastq> tests/data/test.fq.gz contains 100 reads
FASTQ 文件迭代Fastq 每一條記錄有 4 行,其中 comment 通常總為+號,因此有價值的是name, seq, qual三項信息。
>>> import pyfastx
>>> for name,seq,qual in pyfastx.Fastq('tests/data/test.fq.gz', build_index=False):
>>> print(name)
>>> print(seq)
>>> print(qual)也可以直接返回 FASTQ 對象。
>>> import pyfastx
>>> for read in pyfastx.Fastq('test/data/test.fq.gz'):
>>> print(read.name)
>>> print(read.seq)
>>> print(read.qual)
>>> print(read.quali)
FASTQ 類FASTQ 類封裝了 Fastq 文件,為 Fastq 文件常用操作提供方便的接口。
>>> # get read counts in FASTQ
>>> len(fq)
800
>>> # get total bases
>>> fq.size
120000
>>> # get GC content of FASTQ file
>>> fq.gc_content
66.17471313476562
>>> # get composition of bases in FASTQ
>>> fq.composition
{'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}
>>> # New in pyfastx 0.6.10
>>> # get average length of reads
>>> fq.avglen
150.0
>>> # get maximum lenth of reads
>>> fq.maxlen
150
>>> # get minimum length of reas
>>> fq.minlen
150
>>> # get maximum quality score
>>> fq.maxqual
70
>>> # get minimum quality score
>>> fq.minqual
35
>>> # get phred which affects the quality score conversion
>>> fq.phred
33
>>> # Guess fastq quality encoding system
>>> # New in pyfastx 0.4.1
>>> fq.encoding_type
['Sanger Phred+33', 'Illumina 1.8+ Phred+33']
Read 類FASTQ 文件的每一條記錄是一個 read,Read 類為操作 Fastq 記錄提供了接口。
獲取 Read 對象
>>> #get read like a dict by read name
>>> r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']
>>> r1
<Read> A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150
>>> # get read like a list by index
>>> r2 = fq[10]
>>> r2
<Read> A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150
>>> # get the last read
>>> r3 = fq[-1]
>>> r3
<Read> A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150
>>> # check a read weather in FASTQ file
>>> 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq
True獲取 Read 對象的信息。
>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150
>>> # get read order number in FASTQ file
>>> r.id
791
>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'
>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'
>>> # get read length
>>> len(r)
150
>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'
>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:
>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'
>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]
>>> # get read length
>>> len(r)
150
命令行接口$ pyfastx -h
usage: pyfastx COMMAND [OPTIONS]
A command line tool for FASTA/Q file manipulation
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Commands:
index build index for fasta/q file
stat show detailed statistics information of fasta/q file
split split fasta/q file into multiple files
fq2fa convert fastq file to fasta file
subseq get subsequences from fasta file by region
sample randomly sample sequences from fasta or fastq file
extract extract full sequences or reads from fasta/q filePyfastx 還提供了額外接口,方便在命令行下直接操作 Fasta/q 文件。好了,鑑於時間和篇幅有限,這部分大家自行探索吧。
最後,Pyfastx 雖然已經發表文章,但還在持續維護更新。希望大家多多使用,有什麼建議可以跟作者反饋。
好的工具和用戶是共同成長的,祝大家科研順利。
作者簡介:杜聯明,博士,成都大學特聘副研究員。畢業於四川大學生物信息學專業,主要從事基因組學與轉錄組學研究,生物信息學資料庫和數據分析軟體開發。相關成果發表在 Bioinformatics,Molecular Ecology Resources,Aging,Journal of Heredity,Genome Research,Molecular Biology and Evolution 等國際知名期刊上。