[Bioperl-l] Bio::SeqIO, genbank -> fasta, protein only?

Kevin Brown Kevin.M.Brown at asu.edu
Mon Oct 16 12:14:51 EDT 2006


> > Yes, people use the -alphabet parameter. If you set it to 
> something then
> > Bioperl will not try to determine whether the sequence is 
> protein, rna, or
> > dna and this is particularly useful when the sequence 
> contains characters
> > that Bioperl would object to (sequences with distasteful 
> characters can be
> > created by various applications, for example, or you might 
> introduce some
> > weird character for some reason). Setting the -alphabet 
> would also speed up
> > Bioperl a bit, for the same reason.
> 
> Huh. That's what I assumed when I stumbled into the -alphabet 
> parameter. So I thought this would read the protein sequences 
> out of my genbank file and write a fasta file for me:
> 
> my $seq_in  = Bio::SeqIO->new(
>    -file     => "<$file",  
>    -format   => "genbank",
>    -alphabet => "protein"  # No effect?
> );
> my $seq_out = Bio::SeqIO->new(
>    -file     => ">$outfile",
>    -format   => "fasta",
>    -alphabet => "protein"  # No effect?
> );
> while (my $inseq = $seq_in->next_seq) {
>    $inseq->molecule("protein");    # No effect?
>    $seq_out->write_seq($inseq);
> }
> 
> It didn't. Would it be a Good Thing if it did what I was 
> expecting? (Like I said I rolled my own, but I'm always 
> looking for ways to enhance BioPerl that other people might 
> find useful... Someday I will contribute something useful, by 
> golly. -grin-)
> 
> (Background: I'm doing protein BLASTs from genbank files. To 
> make formatdb happy I have to have fasta files full of the 
> protein sequences.)

This might work for your needs (CDS to protein FASTA).

my $seq_in  = Bio::SeqIO->new(
   -file     => "<$file",  
   -format   => "genbank",
);

open my $seq_out, ">$outfile";

while (my $inseq = $seq_in->next_seq) {
   print $seq_out ">". $inseq->display_id(). "\n";
   print $seq_out $inseq->translate() ."\n";
}



More information about the Bioperl-l mailing list