[Bioperl-l] Next-gen modules

Peter biopython at maubp.freeserve.co.uk
Tue Jun 23 07:29:56 EDT 2009

On Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> We just added FASTQ parsing to EMBOSS and faced the same issues.

I was going to chat to you about this at BOSC, and suggest this be
added to EMBOSS - but you are well ahead of me ;)

> Parsing was easy - find the '@' line, read sequence until the '+' line
> is reached, then read (seqlen) quality characters ... and check the next
> line starts with '@'

That is basically what I did for Biopython.

> Quality scores are kept as phred values. Phred of 0 means unknown,
> which in Solexa is -5 (0.75 error rate = could be anything).

A Phred quality of 0 means probability of error is 1, so yes, unknown. I don't
quite follow your leap that this corresponds to a Solexa quality of -5. Could
you clarify?

> We assume lower quality scores are from alignments rather than single reads.

Did you mean to say "higher quality scores" (i.e. lower probability of error),
e.g a PHRED score of 80 which you can get from MAQ doing read mapping
or something consensus based.

> We gave up on trying to guess the quality score standard and require
> users to say whether they are sanger, solexa (1.0) or Illumina (1.3)
> format files. If we only want the sequence then we don't care so we allow
> "fastq" as a sequence format and ignore the quality scores in that case.

What format names have you used? Ideally we'd have the same names
in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and

> We also allow the integer quality score format ... is anyone still using
> that (it looks horrible to me :-)

Do you mean the QUAL file format holding PHRED scores? Roche provide
tools to turn their SFF files into FASTA and QUAL files, so they are still used.

> Code is in the EMBOSS CVS, and will appear in release 6.1.0 on July 15th.
> Any further tips would be very useful.

Great. See you at BOSC 2009!


More information about the Bioperl-l mailing list