[Bioperl-l] EMBL/genbank organism parsing
jason.stajich at duke.edu
Fri Mar 10 14:49:05 EST 2006
Wonderful, thanks for stepping in.
One thing is this may be a good time to note that species data can be
better presented in the taxonomy objects so to ditch Bio::Species and
move to Bio::Taxonomy::Node (a sexy name I know). There is a little
about this on the wiki in the project priority list http://
bioperl.org/wiki/Project_priority_list - I *think* the fields in the
Taxonomy::Node object should be suffient to separate out the field
you are talking about.
As to whether or not to break common_name behavior, I don't have any
opinion right now, but perhaps those who use this data from a file
can speak better to it.
I encourage you to add some text on the wiki pages about whatever you
plan so that we can document what has happened - feel free to just
create a new page for this project and it can be linked in
On Mar 9, 2006, at 9:16 AM, James Abbott wrote:
> Hi Folks,
> The current parsing of OS lines by Bio::SeqIO::embl.pm fails with many
> of the organisms currently found in the database, since the OS lines
> differ considerably from the specification in the EMBL User Manual,
> which appears to have been used as the basis for the current
> parser. In
> an attempt to improve matters, I have collected a set of examples
> hopefully cover the majority of the different ways of writing an
> organism name, and managed to get embl.pm to 'correctly' parse these
> (correctly being open to debate with some of the more esoteric
> examples). I'm sure there are plenty of entries which still don't
> correctly, but it's a start. I'll post the patches to bugzilla once I
> get a few loose ends tidied up.
> In the interests of consistency, I have also obtained the same set of
> sequences from Genbank, and am trying to make both parsers behave the
> same way, however they currently behave in different ways with respect
> to parsing the common name. According to the EMBL spec, the common
> is the English name for the organism given in brackets after the latin
> name, consequently calling the common_name method on an embl.pm parsed
> Bio::Species object returns 'human' for a Homo sapiens (human). The
> genbank parser, however, currently takes the entire SOURCE line,
> including the latin name, consequently calling the common_name
> method on
> a genbank.pm parsed species object returns 'Homo sapiens (human)'.
> would appear to be the intended behavior, since this is considered the
> correct response by the tests.
> Is it considered better to maintain consistency between the EMBL and
> Genbank parsers and risk breaking any code which relies upon the
> behavior of genbank->species->common_name(), or to have the two
> behaving differently, but consistently with their existing behavior?
> Dr. James Abbott <j.abbott at imperial.ac.uk>
> Bioinformatics Software Developer, Bioinformatics Support Service
> Imperial College, London
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l