[Bioperl-l] EMBL/genbank organism parsing
hlapp at gmx.net
Thu Mar 9 13:54:36 EST 2006
Yeah the species parsing has bothered us for a long time.
My thoughts on this - I don't think tweaking individual parsers until
they behave as desired on a then-current set examples is going to put
an end to this. Either species parsing will have to be moved into its
own set of 'drivers' with a fronting factory, like Bio::SpeciesIO or
Bio::TaxonIO, or alternatively like Bio::Factory::TaxonFactoryI and
Bio::Factory::EMBLTaxonFactory etc (similar in concept to
Bio::Factory::LocationFactoryI and Bio::Factory::FTLocationFactory).
Or, quite radical in approach, we require the NCBI taxonomy database
(or any other implementation of Bio::DB::Taxonomy, e.g. could be
through BioSQL or what not) and otherwise disclaim responsibility for
correctly parsing the species.
Even though a TaxonIO or TaxonFactory approach looks like the 'right'
way to do it in terms of SW design principles, I can't help but wonder
why we really should spend much time on writing species line parsers
when NCBI has done the job for us already to put all species into a
compact (file-)database. If people really want to be 100% sure the
parser gets the species right, why not download the NCBI taxonomy
database, index it locally, and simply look-up by taxonID (which is in
the Organism line in EMBL and the feature table in GenBank). Although -
there could be a speed issue due to the recursive lookup - one would
probably want to cache each successful species resolution.
Sorry for not giving precise direction - ideally someone (you?) can
take charge and spearhead overhauling this.
On Mar 9, 2006, at 6:16 AM, James Abbott wrote:
> Hi Folks,
> The current parsing of OS lines by Bio::SeqIO::embl.pm fails with many
> of the organisms currently found in the database, since the OS lines
> differ considerably from the specification in the EMBL User Manual,
> which appears to have been used as the basis for the current parser. In
> an attempt to improve matters, I have collected a set of examples which
> hopefully cover the majority of the different ways of writing an
> organism name, and managed to get embl.pm to 'correctly' parse these
> (correctly being open to debate with some of the more esoteric
> examples). I'm sure there are plenty of entries which still don't parse
> correctly, but it's a start. I'll post the patches to bugzilla once I
> get a few loose ends tidied up.
> In the interests of consistency, I have also obtained the same set of
> sequences from Genbank, and am trying to make both parsers behave the
> same way, however they currently behave in different ways with respect
> to parsing the common name. According to the EMBL spec, the common name
> is the English name for the organism given in brackets after the latin
> name, consequently calling the common_name method on an embl.pm parsed
> Bio::Species object returns 'human' for a Homo sapiens (human). The
> genbank parser, however, currently takes the entire SOURCE line,
> including the latin name, consequently calling the common_name method
> a genbank.pm parsed species object returns 'Homo sapiens (human)'. This
> would appear to be the intended behavior, since this is considered the
> correct response by the tests.
> Is it considered better to maintain consistency between the EMBL and
> Genbank parsers and risk breaking any code which relies upon the
> behavior of genbank->species->common_name(), or to have the two parsers
> behaving differently, but consistently with their existing behavior?
> Dr. James Abbott <j.abbott at imperial.ac.uk>
> Bioinformatics Software Developer, Bioinformatics Support Service
> Imperial College, London
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
More information about the Bioperl-l