[Bioperl-l] fastq splitter

Florent Angly florent.angly at gmail.com
Wed Feb 29 17:33:04 EST 2012


Also, the desc() method returns the part after the whitespace in the 
FASTA header.
Hence, instead of / 1:/, your regular expression should not have the 
space and should be written /1:/. In fact, it would be even better 
(faster) it it were written as an anchored regular expression that 
matches only the beginning of the description, /^1:/

Note that you are apparently using the latest Illumina format, that does 
not follow previous convention on paired-end read headers. Hence your 
script will not work properly with non-latest-Illumina paired-end files.

Florent



On 29/02/12 07:26, Michael Muratet wrote:
>
> On Feb 28, 2012, at 3:11 PM, Sean O'Keeffe wrote:
>
>> Hi,
>> I'm trying to write a quick script to separate one large PE fastq 
>> file into
>> 2 separate files, one for each mate pair
>>
>> The file is of the format (mate1)
>> @HWI-ST156:445:C0EDLACXX:4:1101:1496:1039 1:N:0:ATCACG
>> CTGCTGGTAGTGCCCAAAGACCTCGAATACAATGGGCTTGGTTTTGATGT
>> +
>> BCCFFFFEHHHHHJJJJJHIIJIJJIIGIJJJJJJJIJJJI?FHJJIIJA
>>
>> && (mate2)
>>
>> @HWI-ST156:445:C0EDLACXX:4:2308:20877:199811 2:Y:0:ATCACG
>> TCATAAAAATAACAAAACCACCACCCCATACAAACTCTACTCATCTCCAC
>> +
>> ##################################################
>>
>>
>> My idea is to separate using a regex such that / 1:/ would be the first
>> mate pair and / 2:/ would go in the second mate file.
>> I implemented the code below but each output file is empty. Can someone
>> spot my error?
>>
>> Thanks,
>> Sean.
>>
>> my $infile   = shift;
>> my $outfile1 = $infile."_1";
>> my $outfile2 = $infile."_2";
>>
>> my $seqin = Bio::SeqIO->new(
>>                             -file   => "<$infile",
>>                             -format => "fastq",
>>                             );
>> my $seqout1 = Bio::SeqIO->new(
>>                              -file   => ">$outfile1",
>>                              -format => "fastq",
>>                              );
>>
>> my $seqout2 = Bio::SeqIO->new(
>>                              -file   => ">$outfile2",
>>                              -format => "fastq",
>>                              );
>> while (my $inseq = $seqin->next_seq) {
>>    if ($seqin->desc =~ / 1:/){
> Hi Sean
>
> You're using the desc operator on the stream, not the seq object.
>
> Cheers
>
> Mike
>
>>      $seqout1->write_seq($inseq);
>>    } else {
>>      $seqout2->write_seq($inseq);
>>    }
>> }
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Michael Muratet, Ph.D.
> Senior Scientist
> HudsonAlpha Institute for Biotechnology
> mmuratet at hudsonalpha.org
> (256) 327-0473 (p)
> (256) 327-0966 (f)
>
> Room 4005
> 601 Genome Way
> Huntsville, Alabama 35806
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list