[Bioperl-guts-l] [Bug 2092] Can't store proteins with species using bioperl-db

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Tue Jan 30 00:41:21 EST 2007


http://bugzilla.open-bio.org/show_bug.cgi?id=2092


cjfields at uiuc.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |cjfields at uiuc.edu




------- Comment #7 from cjfields at uiuc.edu  2007-01-30 00:41 -------
(In reply to comment #5)
> Created an attachment (id=461)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=461&action=view) [edit]
> Patch for t/08genbank.t
> 
> Shows that functionality and old-behaviour is present, but doesn't correspond
> to new behaviour.

This is related to the problem seen in Bug 2197, using CP000026 (I used the
first 1000 bp).  Bio::Species classification() and Brian's fix works fine with
the sequence record if using Bio::SeqIO, but the classification is reset in
Bio::DB::BioSQL::SpeciesAdaptor, in populate_from_row(), using bioperl-db with
taxonomy preloaded in the database.  The genus name is then removed from the
species node name:

...
    # in the species object we store the species element without the
    # genus, and similarly for the sub-species and variant
    for(my $i = scalar(@$clf)-2; $i >= 0; $i--) {
    # if this node's name matches the start of the previous one,
    # remove this portion from the previous one's name
    if(index($clf->[$i+1]->[0], $clf->[$i]->[0]) == 0) {
        $clf->[$i+1]->[0] = substr($clf->[$i+1]->[0],
                       length($clf->[$i]->[0])+1);
    }
    # don't do this stuff beyond genus and species
    last if $clf->[$i]->[1] eq "genus";
    }
...

This explains the differences seen in Sendu's tests; the patched genbank tests
pass if the above lines are commented out.

Removing the code above still gets a throw() in Bio::Species since the species
node from NCBI Taxonomy ('Salmonella paratyphi') does not match the node name
('Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150'). 
Due to that, I made the checks Brian added in Bio::Species a bit looser using
regexes instead of explicit matches, and also made the throw() a warn().  The
sequence loads fine after that.

Based on the above example, should we be performing name validation in
Bio::Species::classification() and throwing on a (possible) failure, or just
warning?

Also, do we want to switch to having the full species node name in bioperl-db
(removing the code above seemed to do the trick)?  Or will this cause too many
headaches down the line?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Bioperl-guts-l mailing list