[Bioperl-l] Hierarchical location parsing

Brian Osborne brian_osborne at cognia.com
Tue Mar 29 08:09:05 EST 2005


I didn't see any "join(join..." statements in that Genbank entry, as part of
a source feature or anywhere else. I'm used this URL:


Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Mark Hoebeke
Sent: Friday, March 25, 2005 3:24 PM
To: Brian Osborne
Cc: bioperl-l at portal.open-bio.org
Subject: RE: [Bioperl-l] Hierarchical location parsing


an example of a nested location is found in the 'source' feature of the
Genbank entry having accession AE014074 (Streptococcus pyogenes MGAS315
complete genome). As the file is over 1 Meg in size once compressed it
might not be a good idea to attach it to this mail which is CC'ed to
bioperl-l ;D

Regarding the performance hit of my fix, I feared that replacing a
compiled regexp with a split and a loop over every character of the
string could have a significant impact. As it stands, I timed a simple
parsing script swallowing Genbank files and spitting out each feature
location as a GFF string, on 131 complete microbial genomes. There is no
difference in output between the bioperl-live FTLocationFactory and its
patched version (basically meaning that this test sample did not contain
nested locations). The times are comparable, with even a slight
advantage to the patched version (915.66user 19.53system 15:42.19elapsed
99%CPU vs. 938.06user 17.33system 16:04.15elapsed 99%CPU).

When comparing the outputs of the parser run on a file with a nested
location, it appears that without the bugfix, the nested location yields
an incorrect GFF string as shown by the diff below.

[mark at homer Loc]$ diff MGAS315 MGAS315_patched

I'm still cautious about the bugfix because I only produced the diffs
on microbial genomes, which probably have simpler location definitions
that higher eukaryotes.



Le vendredi 25 mars 2005 à 11:52 -0500, Brian Osborne a écrit :
> Mark,
> Can you also attach the sequence file that you used in order to test your
> code? That way I can write a test specifically for the parsing of
> hierarchical locations.
> You wrote "I'm not sure the new patch won't slow down location parsing
> considerably..." Have you actually timed the parsing using the old and new
> code?
> Thanks again,
> Brian O.

--------------------------Mark.Hoebeke at jouy.inra.fr----------------------
Unité Statistique & Génome                                     Unité MIG
+33 (0)1 60 87 38 03                  Tél.          +33 (0)1 34 65 28 85
+33 (0)1 60 87 38 09                  Fax.          +33 (0)1 34 65 29 01
Tour Evry 2, 523 pl. des Terrasses             INRA - Domaine de Vilvert
F - 91000 Evry                             F - 78352 Jouy-en-Josas CEDEX

More information about the Bioperl-l mailing list