[Bioperl-l] Next-gen modules
biopython at maubp.freeserve.co.uk
Mon Jun 22 10:24:55 EDT 2009
On Wed, Jun 17, 2009 at 6:06 PM, Chris Fields wrote:
> Peter wrote:
>> Other issues to keep in mind:
>> (3) There should be no warning parsing files where the optional repeated
>> title is missing on the "+" lines (as discussed earlier on the BioPerl
> Agreed, though we'll have to check the current fastq parser to see if that's
> currently the case. I thought that was fixed but maybe not?
>> (4) When writing FASTQ files should BioPerl omit the optional repeated
>> title on the "+" line? Biopython omits this as I understand this to be
>> common practice, and can make a big different to file sizes - especially
>> on short read data from Solexa/Illumina.
> Agreed, particularly if it's commonly encountered.
>> (5) Also test reading and writing files with an optional description (as
>> well as an identifier) on the "@" (and "+") lines. See the NCBI SRA
>> for examples, e.g.
>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> Should be easy enough to implement with a simple regex.
>> (6) Test reading and writing files where the encoded quality string starts
>> with a "@" or a "+" character, e.g.
> Mark, getting all that? ;>
Another couple of points that I should have remembered earlier,
related to converting between PHRED scores and Solexa scores.
On the bright side, with Illumina abandoning the Solexa scores
in pipeline 1.3+, these issues will go away with time:
(7) If BioPerl will be converting Solexa scores to/from PHRED
scores as integers automatically (as discussed earlier), make
sure you round to the nearest whole number (don't just truncate
with a call to int!). MAQ does this by adding 0.5 before calling
int (while in Biopython I just use Python's round function).
(8) When asked to write out an old Solexa style FASTQ file,
what will you do if given a standard Sanger FASTQ file (or a
new Illumina 1.3+ FASTQ file) containing a base with PHRED
quality zero? This maps to a Solexa quality of minus infinity...
Right now the development version of Biopython will throw an
error in this situation, but mapping to the lowest observed
Solexa score might be reasonable.
More information about the Bioperl-l