cjfields at uiuc.edu
Fri Mar 2 09:35:34 EST 2007
The current parsers are slightly faster, but not enough to make a
huge difference unless you're parsing thousands of sequences.
However, it does demonstrate that a good deal of the performance
issues stem from object creation and not parsing, an issue that is
already known. For instance, if you do everything up to (but skip)
instantiation of an object, like a SeqFeature/Annotation/Species, the
parsing speeds up dramatically dependent on the number of objects
created. I also saw significant increases in speed when using
FTHelper (instead of SeqFeatures) or Bio::Taxon (instead of
Bio::Species), so lighter objects definitely help.
I basically just separate the two key steps into two distinct tasks
(driver and handler); I haven't thought much about validation though
I would probably separate that into a third task. Regardless, the
current drivers are flexible enough to deal with the occasional
oddity and not die. It's much easier to maintain and extend; for
instance if you wanted to develop lightweight objects it's now easier
to accomplish (i.e. rewrite/overload a handler vs. rewrite next_seq
() ), and you can separately develop a faster driver via next_seq()
as long as it threw the same data structure.
Multiple parsers can also use the same handler. I currently have
GenBank/EMBL/SwissProt all sharing the same handler and passing all
On Mar 2, 2007, at 12:08 AM, Heikki Lehvaslaiho wrote:
> This sounds great. Is the speed increase noticeable?
> On Thursday 01 March 2007 17:24:03 Chris Fields wrote:
>> I do have a rough outline of what I think could be done:
>> where you could switch out handlers to deal with incoming data
>> chunks. Any suggestions there are welcome.
>> I'll probably commit examples of the above in the next week or two
>> (GenBank, EMBL, Swiss parsers using the same handlers) which don't
>> use FTHelper. So far I have all three passing tests based on
>> embl/swiss.t but they need a few more tweaks before I commit.
>> On Mar 1, 2007, at 5:02 AM, Heikki Lehvaslaiho wrote:
>>> It was meant to collect code that was common to all three main
>>> databases using
>>> similar feature tables.
>>> Now might be the time to optimise the parsing speed by removing it.
>>> Do you
>>> have a plan how to do it?
>>> On Tuesday 27 February 2007 22:57:40 Chris Fields wrote:
>>>> Could anyone tell me what FTHelper is used for? From what I gather
>>>> it rolls up seqfeature data into a lightweight object but then
>>>> creates a SeqFeature::Generic anyway (at least for GenBank/EMBL/
>>>> Swiss), which seems to be a waste of memory and time. Is there
>>>> something I'm missing (besides my sanity of course)?
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>> ______ _/ _/
>>> _/ _/
>>> _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za
>>> _/_/_/_/_/ Associate Professor skype: heikki_lehvaslaiho
>>> _/ _/ _/ SANBI, South African National Bioinformatics
>>> _/ _/ _/ University of Western Cape, South Africa
>>> _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512
>>> ___ _/_/_/_/_/
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
> ______ _/ _/_____________________________________________________
> _/ _/
> _/ _/ _/ Heikki Lehvaslaiho heikki at_sanbi _ac _za
> _/_/_/_/_/ Associate Professor skype: heikki_lehvaslaiho
> _/ _/ _/ SANBI, South African National Bioinformatics Institute
> _/ _/ _/ University of Western Cape, South Africa
> _/ Phone: +27 21 959 2096 FAX: +27 21 959 2512
> ___ _/_/_/_/_/________________________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l