[Bioperl-l] Bio::DB::GenBank and complexity

Chris Fields cjfields at uiuc.edu
Tue May 2 12:19:34 EDT 2006


I ran into some wonkiness with using extra parameters ('seq_start',
'seq_stop', 'strand', and 'complexity') with Bio::DB::GenBank that I have
gone through, fixed, and committed.  I also have added a few tests to DB.t
for everything (all changes were in Bio::DB::WebDBSeqI and
Bio::DB::NCBIHelper).  The 'complexity' tag is the strangest, though I did
manage to get it added as well (with tests).  This is how NCBI defines
complexity:

complexity regulates the display:
0 - get the whole blob
1 - get the bioseq for gi of interest (default in Entrez)
2 - get the minimal bioseq-set containing the gi of interest
3 - get the minimal nuc-prot containing the gi of interest
4 - get the minimal pub-set containing the gi of interest

Here's my quandary; when setting complexity to '0', you get a glob back (the
main sequence as well as any subsequences, such as CDS); this is in essence
a sequence stream with multiple alphabet types.  So, I now have it set up to
do this:

my $factory = Bio::DB::GenBank->new(-format => 'fasta',
                                    -complexity => 0
                                   );

my $seqin = $factory->get_Seq_by_acc($acc);

while (my $seq = $seqin->next_seq) {
    $seqout->write_seq($seq);
}

since I thought returning an array would be horrendously expensive on
memory, esp. with larger sequences.  Currently this is only set up for
sequences which are retrieved when complexity is set to '0' so it's a pretty
unique case.  Regardless, I'm worried that, since users expect a Bio::Seq
object instead of a Bio::SeqIO object here, it will cause a lot of confusion
with the API.  Any suggestions/gripes?

Chris

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 




More information about the Bioperl-l mailing list