[Bioperl-l] RE: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid

Brian Osborne brian_osborne at cognia.com
Thu Mar 11 10:17:17 EST 2004


Right, some discussion is called for. What James is saying is that Genbank
entries with ORGANISM lines like "Unknown marine bacterium" should have
Species objects, although binomial() will return something like "unknown".
This Species object will be useful to those who want to use classification()
and ncbi_taxid(). Are there votes or comments?

Brian O.

-----Original Message-----
From: James Wasmuth [mailto:james.wasmuth at ed.ac.uk]
Sent: Thursday, March 11, 2004 9:40 AM
To: Brian Osborne
Cc: bioperl-guts-l at bioperl.org
Subject: Re: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid

Brian and all at bioperl-guts,

below is the comment I've added to the bug[1600].  I think it may need
some discussion, but the patch I've added works to the extent that it
allows creation of a Bio::Species object but the subsequent genus,
species, subspecies calls will be 'wrong'.  Personally I'm more
concerned with the taxid, which I think will be sufficient.

If you want to see the size of this problem go to NCBI taxonomy and
enter the term identified as a token set!  I think that maintaining the
taxid is enough, otherwise the artifical split of terms such as
**unidentified diatom endosymbiont of Peridinium foliaceum*
may be a problem, though some of them are intuitive.

One last question, I've never tried to fix a bug before, so I've
commited a patch as an attachment to Bugzilla for the bug.  Do others
check this and if okay place it in the code...
apologies for the newbie bit...



line 1123: return unless $genus and  $genus !~ /^(Unknown|None)$/oi;

a number of species are described as Unknown blah blah blah.

The NCBI taxid assigned to unknown taxa is 32644 and has a number of
synonyms, none of which are 'unknown'.

The list includes: other, unknown organism, not specified, not shown,
unspecified, Unknown, None, unclassified , unidentified organism

I've changed the _read_GenBank_Species subroutine to allow organism
names such as 'unknown marine gamma proteobacterium NOR5'.  This will
create a Bio::Species object, but the genus=unknown species=marine

There is a whole host of species names that ignore the nice rules in
_read_GenBank_Species!  However this fix will allow the correct taxid to
be provided which I think is more than the name!

sub _read_GenBank_Species {
    my( $self,$buffer) = @_;
    my @organell_names = ("chloroplast", "mitochondr");
     # only those carrying DNA, apart from the nucleus

     my @unkn_names=("other", 'unknown organism', 'not specified', 'not
shown', 'Unspecified', 'Unknown', 'None', 'unclassified', 'unidentified

    $_ = $$buffer;

    my( $sub_species, $species, $genus, $common, $organelle, @class,
$ns_name );
    # upon first entering the loop, we must not read a new line -- the
    # line is already in the buffer (HL 05/10/2000)
    while (defined($_) || defined($_ = $self->_readline())) {
    # de-HTMLify (links that may be encountered here don't contain
    # escaped '>', so a simple-minded approach suffices)
    if (/^SOURCE\s+(.*)/o) {
        # FIXME this is probably mostly wrong (e.g., it yields things like
        # Homo sapiens adult placenta cDNA to mRNA
        # which is certainly not what you want)
        $common = $1;
        $common =~ s/\.$//; # remove trailing dot
    } elsif (/^\s{2}ORGANISM/o) {
        my @spflds = split(' ', $_);
            ($ns_name) = $_ =~ /\w+\s+(.*)/o;
        shift(@spflds); # ORGANISM

         if(grep { $_ =~ /^$spflds[0]/i; } @organell_names) {
        $organelle = shift(@spflds);
            $genus = shift(@spflds);
        if(@spflds) {
        $species = shift(@spflds);
        } elsif ( grep { $genus } @unkn_names){
        $species = '';
        } else {$species='sp.';}      #there's no species name but it
isn't unclassified
        $sub_species = shift(@spflds) if(@spflds);
        } elsif (/^\s+(.+)/o) {
        # only split on ';' or '.' so that
        # classification that is 2 words will
        # still get matched
        # use map to remove trailing/leading spaces
            push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/,
        } else {

        $_ = undef; # Empty $_ to trigger read of next line

     $$buffer = $_;

     # Don't make a species object if it's empty or "Unknown" or "None"
    my $unkn = grep { $_ =~ /^$genus$species/i; } @unkn_names;

     return unless $genus and  $unkn==0;

     # Bio::Species array needs array in Species -> Kingdom direction
    if ($class[0] eq 'Viruses') {
        push( @class, $ns_name );
    elsif ($class[$#class] eq $genus) {
        push( @class, $species );
    } else {
        push( @class, $genus, $species );
    @class = reverse @class;

    my $make = Bio::Species->new();
    $make->classification( \@class, "FORCE" ); # no name validation please
    $make->common_name( $common      ) if $common;
    unless ($class[-1] eq 'Viruses') {
        $make->sub_species( $sub_species ) if $sub_species;
    $make->organelle($organelle) if $organelle;
    return $make;

Brian Osborne wrote:

>Your guess is right, no Species is made because of the name. That's because
>genbank.pm normally looks at:
>ORGANISM Bos taurus
>And makes "Bos" the genus, and so on.
>If it sees:
>It refuses to make a Species object, and it's interpreting your ORGANISM
>line in the same way because it can't make a valid genus, that's the
>rule. Personally I'd say that I agree with its principle - how can we make
>Species object without genus and species?
>You can get the taxid from a SeqFeature object, you already knew that.
>Brian O.
>-----Original Message-----
>From: bioperl-guts-l-bounces at portal.open-bio.org
>[mailto:bioperl-guts-l-bounces at portal.open-bio.org]On Behalf Of
>bugzilla-daemon at portal.open-bio.org
>Sent: Thursday, March 11, 2004 4:21 AM
>To: bioperl-guts-l at bioperl.org
>Subject: [Bioperl-guts-l] [Bug 1600] New: $gb->species->ncbi_taxid
>           Summary: $gb->species->ncbi_taxid
>           Product: Bioperl
>           Version: unspecified
>          Platform: PC
>        OS/Version: Linux
>            Status: NEW
>          Severity: normal
>          Priority: P2
>         Component: Bio::SeqIO
>        AssignedTo: bioperl-guts-l at bioperl.org
>        ReportedBy: james.wasmuth at ed.ac.uk
>I've included a genbank file for which I have been unable to extract the
>ncbi_taxid for using
>the error is:
>Can't call method "ncbi_taxid" on an undefined value
>infact I don't get a Bio::Species object.  I'm sure its because of the
>which is correct.
>I've tried looking into it, but could not find which Seq object creates the
>Bio::Species object.
>LOCUS       AY007676                1389 bp    DNA     linear   BCT
>DEFINITION  Unknown marine gamma proteobacterium NOR5 16S ribosomal RNA,
>            partial sequence.
>VERSION     AY007676.1  GI:12000362
>SOURCE      unknown marine gamma proteobacterium NOR5
>  ORGANISM  unknown marine gamma proteobacterium NOR5
>            Bacteria; Proteobacteria; Gammaproteobacteria.
>REFERENCE   1  (bases 1 to 1389)
>  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Glockner,F.O., Gerdts,G.
>            Amann,R.
>  TITLE     Isolation of novel pelagic bacteria from the German bight and
>            seasonal contributions to surface picoplankton
>  JOURNAL   Appl. Environ. Microbiol. 67 (11), 5134-5142 (2001)
>  MEDLINE   21536174
>   PUBMED   11679337
>REFERENCE   2  (bases 1 to 1389)
>  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
>            Schuett,C. and Amann,R.
>  TITLE     Identification and seasonal dominance of culturable marine
>  JOURNAL   Unpublished
>REFERENCE   3  (bases 1 to 1389)
>  AUTHORS   Eilers,H., Pernthaler,J., Peplies,J., Gloeckner,F.O.,
>            Schuett,C. and Amann,R.
>  TITLE     Direct Submission
>  JOURNAL   Submitted (29-AUG-2000) Molecular Ecology,
>            Celsiusstrasse 1, Bremen 28359, Germany
>FEATURES             Location/Qualifiers
>     source          1..1389
>                     /organism="unknown marine gamma proteobacterium NOR5"
>                     /mol_type="genomic DNA"
>                     /db_xref="taxon:145658"
>     rRNA            <1..>1389
>                     /product="16S ribosomal RNA"
>BASE COUNT      343 a    319 c    453 g    274 t
>        1 cgcgaaagta cttcggtatg agtagagcgg cggacgggtg agtaacgcgt aggaatctat
>       61 ccagtagtgg gggacaactc ggggaaactc gagctaatac cgcatacgtc ctaagggaga
>      121 aagcggggga tcttcggacc tcgcgctatt ggaggagcct gcgttggatt agctagttgg
>      181 tggggtaaag gcctaccaag gcgacgatcc atagctggtc tgagaggatg atcagccaca
>      241 ccgggactga gacacggccc ggactcctac gggaggcagc agtggggaat attgcgcaat
>      301 gggcgaaagc ctgacgcagc catgccgcgt gtgtgaagaa ggccttcggg ttgtaaagca
>      361 ctttcaattg ggaagaaagg ttagtagtta ataactgcta gctgtgacat tacctttaga
>      421 agaagcaccg gctaactccg tgccagcagc cgcggtaata cggaggtgcg agcgttaatc
>      481 ggaattactg ggcgtaaagc gcgcgtaggc ggtctgttaa gtcggatgtg aaagccccgg
>      541 gctcaacctg ggaattgcac ccgatactgg ccgactggag tgcgagagag ggaggtagaa
>      601 ttccacgtgt agcggtgaaa tgcgtagata tgtggaggaa taccggtggc gaaggcggcc
>      661 tcctggctcg acactgacgc tgaggtgcga aagcgtgggg agcaaacagg attagatacc
>      721 ctggtagtcc acgccgtaaa cgatgtctac tagccgttgg gagacttgat ttcttggtgg
>      781 cgaagttaac gcgataagta gaccgcctgg ggagtacggc cgcaaggtta aaactcaaat
>      841 gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgatg caacgcgaag
>      901 aaccttacca ggccttgaca tcctaggaat cctgtagaga tacgggagtg ccttcgggaa
>      961 tctagtgaca ggtgctgcat ggctgtcgtc agctcgtgtc gtgagatgtt gggttaagtc
>     1021 ccgtaacgag cgcaaccctt gtccttagtt gccagcgcgt aatggcggga actctaagga
>     1081 gactgccggt gacaaaccgg aggaaggtgg ggacgacgtc aagtcatcat ggcccttacg
>     1141 gcctgggcta cacacgtgct acaatggaac gcacagaggg cagcaaaccc gcgaggggga
>     1201 gcgaatccca caaaacgttt cgtagtccgg atcggagtct gcaactcgac tccgtgaagt
>     1261 cggaatcgct agtaatcgtg aatcagaatg tcacggtgaa tacgttcccg ggccttgtac
>     1321 acaccgcccg tcacaccatg ggagtgggtt gctccagaag tggttagcct aaccttcggg
>     1381 agggcgatc
>------- You are receiving this mail because: -------
>You are the assignee for the bug, or are watching the assignee.
>Bioperl-guts-l mailing list
>Bioperl-guts-l at portal.open-bio.org

"I have not failed. I've just found 10,000 ways that don't work."
               --- Thomas Edison

Nematode Bioinformatics           ||
Blaxter Nematode Genomics Group   ||
School of Biological Sciences     ||
Ashworth Laboratories             ||
King's Buildings                  ||    tel: +44 131 650 7403
University of Edinburgh           ||    web: www.nematodes.org
Edinburgh                         ||
EH9 3JT                           ||
UK                                ||

More information about the Bioperl-l mailing list