[Bioperl-l] Categorization of EST's by species/taxonomy/lineage

Jason Stajich jason at cgt.duhs.duke.edu
Mon May 3 09:05:27 EDT 2004

ugh - I hope you are not really going to do this... - I'll post code which
should work with Bio::DB::Taxonomy as this was what it was intended to do.

On Thu, 29 Apr 2004, Paulo Almeida wrote:

> Perhaps a little far fetched, but what if you write a script that goes like:
> Read sequence from flat file.
> If it's from a new species,
>     Blast against mRNA database AND species database
>     Check hits until you get one with the full lineage
> End if
> Sort it based on the lineage
> I'm suggesting that only based on what you said, about mRNA records
> tending to have the full lineage, I have no idea if it would work. To
> blast against mRNA for the desired species you would add something like:
> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = '$species [ORGN] AND biomol_mrna [PROP]';
> -Paulo Almeida
> Mark Johnson wrote:
> >     I've got a bunch of flat files containing EST sequences (GenBank
> >format) from the NCBI ftp site.  I'd like to sort through them,
> >categorize them, and build some blast databases.  It would be nice to
> >be able to sort them into a few different piles, such as vertebrate,
> >invertebrate, fungi, species1, species2, speciesN, etc.
> >     To this end, having the full 'lineage' available would be handy.
> >However, EST records from the EST database only have the organism
> >(unlike, say, mRNA records from the nucleotide database, which tend
> >to have the full lineage (Eukaryota; Metazoa; Chordata; Craniata;
> >Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;
> >Hominidae; Homo).
> >     With mRNA records from the nucleotide database, this is an easy job,
> >just call $seq->species->classification(), and sort through the list.
> > However, with these EST files from dbEST, that doesn't work, the
> >resulting list is empty.
> >     I initially had high hopes after discovering Bio::DB::Taxonomy, but
> >there are some bugs in the 1.4 version, and even upgrading to the
> >latest in CVS, I can't seem to find a way to get the full lineage:
> >
> >#Bio::DB::Taxonomy (Well, really Bio::DB::Taxonomy::entrez)
> >my $db = new Bio::DB::Taxonomy(-source => 'entrez');
> >my $taxaid = $db->get_taxonid('Homo sapiens');
> >
> >#Bio::Taxonomy::Node
> >my $taxobj = $db->get_Taxonomy_Node(-taxonid => $taxaid);
> >
> >#@classificiation contains 'sapiens' and 'homo'.
> >my @classification = $taxobj->classification();
> >
> >Looking at the code for the classification method, I came accross this
> >comment:  # okay this won't really work - need to do proper recursion
> >
> >So...is there a way to get to where I want to be without hacking on the
> >module(s) in some terribly caveman like fashion?
> >
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
Duke University
jason at cgt.mc.duke.edu

More information about the Bioperl-l mailing list