[Bioperl-l] fastq splitter

Fields, Christopher J cjfields at illinois.edu
Wed Feb 29 09:38:50 EST 2012

On Feb 29, 2012, at 7:07 AM, Michael Muratet wrote:

> On Feb 28, 2012, at 4:01 PM, Sean O'Keeffe wrote:
>> Hi Chris,
>> Unfortunately the read pairs are not consecutive. It seems they are cat'd
>> together.
>> I could use split -l on the line number that they're glued together I guess.
>> If this is an overnight job for a bunch of files, I can wait so don't mind
>> using the module if it worked.
>> Someone pointed out I need to switch $seqin->desc to $inseq->desc.
>> However, now it spits out fasta output instead of fastq and returns a bunch
>> of warnings: Seq/Qual descriptions don't match; using sequence description
> Hi Sean
> Apparently the bioperl parser expects the the 'second' header line, i.e.,
> @first_header
> sequence
> +second_header
> quality_scores
> to have the same (redundant) identifier. When it encounters a blank line, which is the way the Illumina pipeline writes it out, it warns you.
> I think you have to explicitly write out the quality scores in fastq format.
> Cheers
> Mike

Actually no, that's not true for the latest versions.  It was completely refactored in coordination with Peter Cock (Biopython) and the other Bio* toolkits along with EMBOSS to parse a wide range of FASTQ data (including the solexa/illumina variants), and also attempt to catch bad formatting issues.  See this pub:


This is one of the primary test examples that passes:



More information about the Bioperl-l mailing list