今天强烈安利一个基因组序列处理神器:Pyfastx。软件历经20个月、70个版本,文章发表在生信顶刊《Briefings in Bioinformatics》上,Li Heng大神亲测推荐。

作者是我的同门师兄,现在任职于成都大学,专注于生物信息学研究,是真正的Tool maker。

Pyfastx: a robust python module for fast random access to sequences from plain and gzipped FASTA/Q file

文章地址:https://doi.org/10.1093/bib/bbaa368

代码地址:https://github.com/lmdu/pyfastx

模块的使用文档非常详细,感兴趣的朋友可以直接到pyfastx官网查看使用说明。

本文仅作抛砖引玉,首先我们来看一下pyfastx的特点。

  • 一个接口同时满足FASTA/Q文件读写需求
  • 轻量级、内存节约
  • 随机访问压缩的FASTA/Q文件
  • 逐条迭代读取FASTA文件
  • 计算FASTA文件的N50和L50
  • 计算序列的GC含量和核酸组成
  • 计算反向互补序列
  • 良好的兼容性,支持分析非标准的FASTA文件
  • 支持FASTQ文件的碱基质量值转换
  • 提供命令行接口用于拆分FASTA/Q文件

功能很多,覆盖了平时序列文件操作的常见需求。Pyfastx内部含有多个功能模块,比如:

  • FASTX 接口,为迭代Fasta/q文件提供统一的接口
  • FASTA 接口,迭代或随机访问Fasta文件
  • FASTQ 接口 ,迭代或随机访问Fastq文件
  • Fasta 类,封装好的Fasta文件类
  • Fastq类,封装好的Fastq文件类
  • Sequence类,提供Fasta记录的常用操作
  • Read类,提供Fastq记录的常用操作

安装

目前,pyfastx支持Python 3.5以上的版本,通过pip即可安装。

1
pip install pyfastx

FASTX 模块

FASTA 文件迭代

迭代 Fasta 文件时,返回一个元组(name, seq, comment),其中 comment 是标题栏第一个空格后面的内容。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> fa = pyfastx.Fastx('tests/data/test.fa.gz')
>>> for name,seq,comment in fa:
>>>     print(name)
>>>     print(seq)
>>>     print(comment)

>>> #always output uppercase sequence
>>> for item in pyfastx.Fastx('tests/data/test.fa', uppercase=True):
>>>     print(item)

>>> #Manually specify sequence format
>>> for item in pyfastx.Fastx('tests/data/test.fa', format="fasta"):
>>>     print(item)

FASTQ 文件迭代

迭代Fastq文件时,返回一个元组(name, seq, qual, comment),其中 comment 是标题栏第一个空格后面的内容。

1
2
3
4
5
6
>>> fq = pyfastx.Fastx('tests/data/test.fq.gz')
>>> for name,seq,qual,comment in fq:
>>>     print(name)
>>>     print(seq)
>>>     print(qual)
>>>     print(comment)

FASTA 模块

读取Fasta文件,并且支持随机访问其中的任意序列。

这里要说明一下顺序迭代和随机读取的区别。顺序迭代顾名思义就是从一个文件的开始逐条记录往后读,直至最后一条记录。

随机读取就是能够直接访问指定的序列,不需要从头读到尾。怎么实现呢?一般是先要为序列建立索引,pyfastx也不例外。

FASTA 文件读取

1
2
3
4
>>> import pyfastx
>>> fa = pyfastx.Fasta('test/data/test.fa.gz')
>>> fa
<Fasta> test/data/test.fa.gz contains 211 seqs

FASTA 文件迭代

Fasta文件中每条序列最重要的就是名称和序列信息了,这两个信息可以方便地通过迭代返回。

1
2
3
>>> import pyfastx
>>> for name, seq in pyfastx.Fasta('test/data/test.fa.gz', build_index=False):
>>>     print(name, seq)

此外也可以直接返回 FASTA 对象。

1
2
3
4
5
>>> import pyfastx
>>> for seq in pyfastx.Fasta('test/data/test.fa.gz'):
>>>     print(seq.name)
>>>     print(seq.seq)
>>>     print(seq.description)

FASTA 类

FASTA 对象有许多属性和方法可供使用,如计算 GC 含量、计算 N50/L50、提取任意序列等。

以提取指定序列为例,FASTA 不仅可以提取指定序列,还可以指定序列的某一区间。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
>>> # get subsequence with start and end position
>>> interval = (1, 10)
>>> fa.fetch('JZ822577.1', interval)
'CTCTAGAGAT'

>>> # get subsequences with a list of start and end position
>>> intervals = [(1, 10), (50, 60)]
>>> fa.fetch('JZ822577.1', intervals)
'CTCTAGAGATTTTAGTTTGAC'

>>> # get subsequences with reverse strand
>>> fa.fetch('JZ822577.1', (1, 10), strand='-')
'ATCTCTAGAG'

当然,FASTA 对象还有更加花式的序列提取方式。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
>>> # get sequence like a dictionary by identifier
>>> s1 = fa['JZ822577.1']
>>> s1
<Sequence> JZ822577.1 with length of 333

>>> # get sequence like a list by index
>>> s2 = fa[2]
>>> s2
<Sequence> JZ822579.1 with length of 176

>>> # get last sequence
>>> s3 = fa[-1]
>>> s3
<Sequence> JZ840318.1 with length of 134

>>> # check a sequence name weather in FASTA file
>>> 'JZ822577.1' in fa
True

Sequence 类

FASTA 文件的记录,被封装成 Sequence 类,为操作 FASTA 文件的记录提供接口。例如,你可以对它进行切片。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
>>> # get a sub seq from sequence
>>> s = fa[-1]
>>> ss = s[10:30]
>>> ss
<Sequence> JZ840318.1 from 11 to 30

>>> ss.name
'JZ840318.1:11-30'

>>> ss.seq
'CTTCTTCCTGTGGAAAGTAA'

>>> ss = s[-10:]
>>> ss
<Sequence> JZ840318.1 from 125 to 134

>>> ss.name
'JZ840318.1:125-134'

>>> ss.seq
'CCATGTTGGT'

或者对 Sequence 进行反向、互补或反向互补。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
>>> # get sliced sequence
>>> fa[0][10:20].seq
'GTCAATTTCC'

>>> # get reverse of sliced sequence
>>> fa[0][10:20].reverse
'CCTTTAACTG'

>>> # get complement of sliced sequence
>>> fa[0][10:20].complement
'CAGTTAAAGG'

>>> # get reversed complement sequence, corresponding to sequence in antisense strand
>>> fa[0][10:20].antisense
'GGAAATTGAC'

FASTQ 模块

FASTQ 文件读取

读取Fastq文件,并支持随机访问,前提是先要构建索引。

1
2
3
4
>>> import pyfastx
>>> fq = pyfastx.Fastq('tests/data/test.fq.gz')
>>> fq
<Fastq> tests/data/test.fq.gz contains 100 reads

FASTQ 文件迭代

Fastq每一条记录有 4 行,其中 comment 通常总为+号,因此有价值的是name, seq, qual三项信息。

1
2
3
4
5
>>> import pyfastx
>>> for name,seq,qual in pyfastx.Fastq('tests/data/test.fq.gz', build_index=False):
>>>     print(name)
>>>     print(seq)
>>>     print(qual)

也可以直接返回 FASTQ 对象。

1
2
3
4
5
6
>>> import pyfastx
>>> for read in pyfastx.Fastq('test/data/test.fq.gz'):
>>>     print(read.name)
>>>     print(read.seq)
>>>     print(read.qual)
>>>     print(read.quali)

FASTQ 类

FASTQ 类封装了 Fastq 文件,为 Fastq 文件常用操作提供方便的接口。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
>>> # get read counts in FASTQ
>>> len(fq)
800

>>> # get total bases
>>> fq.size
120000

>>> # get GC content of FASTQ file
>>> fq.gc_content
66.17471313476562

>>> # get composition of bases in FASTQ
>>> fq.composition
{'A': 20501, 'C': 39705, 'G': 39704, 'T': 20089, 'N': 1}

>>> # New in pyfastx 0.6.10
>>> # get average length of reads
>>> fq.avglen
150.0

>>> # get maximum lenth of reads
>>> fq.maxlen
150

>>> # get minimum length of reas
>>> fq.minlen
150

>>> # get maximum quality score
>>> fq.maxqual
70

>>> # get minimum quality score
>>> fq.minqual
35

>>> # get phred which affects the quality score conversion
>>> fq.phred
33

>>> # Guess fastq quality encoding system
>>> # New in pyfastx 0.4.1
>>> fq.encoding_type
['Sanger Phred+33', 'Illumina 1.8+ Phred+33']

Read 类

FASTQ 文件的每一条记录是一个 read,Read 类为操作 Fastq 记录提供了接口。

获取 Read 对象

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
>>> #get read like a dict by read name
>>> r1 = fq['A00129:183:H77K2DMXX:1:1101:4752:1047']
>>> r1
<Read> A00129:183:H77K2DMXX:1:1101:4752:1047 with length of 150

>>> # get read like a list by index
>>> r2 = fq[10]
>>> r2
<Read> A00129:183:H77K2DMXX:1:1101:18041:1078 with length of 150

>>> # get the last read
>>> r3 = fq[-1]
>>> r3
<Read> A00129:183:H77K2DMXX:1:1101:31575:4726 with length of 150

>>> # check a read weather in FASTQ file
>>> 'A00129:183:H77K2DMXX:1:1101:4752:1047' in fq
True

获取 Read 对象的信息。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
>>> r = fq[-10]
>>> r
<Read> A00129:183:H77K2DMXX:1:1101:1750:4711 with length of 150

>>> # get read order number in FASTQ file
>>> r.id
791

>>> # get read name
>>> r.name
'A00129:183:H77K2DMXX:1:1101:1750:4711'

>>> # get read full header line, New in pyfastx 0.6.3
>>> r.description
'@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG'

>>> # get read length
>>> len(r)
150

>>> # get read sequence
>>> r.seq
'CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA'

>>> # get raw string of read, New in pyfastx 0.6.3
>>> print(r.raw)
@A00129:183:H77K2DMXX:1:1101:1750:4711 1:N:0:CAATGGAA+CGAGGCTG
CGAGGAAATCGACGTCACCGATCTGGAAGCCCTGCGCGCCCATCTCAACCAGAAATGGGGTGGCCAGCGCGGCAAGCTGACCCTGCTGCCGTTCCTGGTCCGCGCCATGGTCGTGGCGCTGCGCGACTTCCCGCAGTTGAACGCGCGCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:

>>> # get read quality ascii string
>>> r.qual
'FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFF,F:FFFFFFFFF:'

>>> # get read quality integer value, ascii - 33 or 64
>>> r.quali
[37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25]

>>> # get read length
>>> len(r)
150

命令行接口

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
$ pyfastx -h

usage: pyfastx COMMAND [OPTIONS]

A command line tool for FASTA/Q file manipulation

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

Commands:

    index        build index for fasta/q file
    stat         show detailed statistics information of fasta/q file
    split        split fasta/q file into multiple files
    fq2fa        convert fastq file to fasta file
    subseq       get subsequences from fasta file by region
    sample       randomly sample sequences from fasta or fastq file
    extract      extract full sequences or reads from fasta/q file

Pyfastx 还提供了额外接口,方便在命令行下直接操作 Fasta/q 文件。好了,鉴于时间和篇幅有限,这部分大家自行探索吧。

最后,Pyfastx 虽然已经发表文章,但还在持续维护更新。希望大家多多使用,有什么建议可以跟作者反馈。

好的工具和用户是共同成长的,祝大家科研顺利。

作者简介:杜联明,博士,成都大学特聘副研究员。毕业于四川大学生物信息学专业,主要从事基因组学与转录组学研究,生物信息学数据库和数据分析软件开发。相关成果发表在Bioinformatics,Molecular Ecology Resources,Aging,Journal of Heredity,Genome Research,Molecular Biology and Evolution等国际知名期刊上。

欢迎关注简说基因,觉得不错,请点个“赞”吧!