[Bioperl-l] SearchIO speed up
bix at sendu.me.uk
Mon Aug 14 13:57:37 EDT 2006
aaron.j.mackey at GSK.COM wrote:
>> And then of course the idea is that this is nested, so the parser for
>> the result data is a Bio::Search::Result::ResultI but also a pull-parser
>> in its own right (and so on for HitI and HSPI) with a need for
>> random-access to the various bits of data needed to answer all the
>> various methods of ResultI.
> the second- (and third- and so on) level parsers can work on in-memory
> "blobs" (if seeking is unavailable), as these will be minute in
> comparison; it's only the top-level SearchIO parser that need fuss about
> streaming pipes and seekability.
Oh, I'd disagree with that. A file given to SearchIO may only have 1
result in it, but that single result could be 99.999% of the 1000MB
file. That result might have only one hit, taking 99.99% of the file.
And then the user might only be interested in the first hsp, which takes
0.001 % of the file. You don't want to go around chucking in-memory
blobs like those to your Result and Hit objects if you can avoid it.
>> I currently have a -piped_behaviour argument that accepts 'memory' or
> does it default to memory?
Yes, but the acceptable options and the defaults could vary for
different pull-parser-based SearchIO modules. Since the goal here is
increased speed of SearchIO, I'm tempted to say that even for a BLAST
parser the default should be 'memory' (read everything in first).
> fundamentally, parsing occurs when regular expressions operate on
> in-memory blobs; so while you can keep lots of file pointers around to
> define many largish blobs with minimal memory footprint, at some point
> they need to become memory-resident for the parser to take effect.
I try to keep a good balance here. I also throw away a blob as soon as
I've parsed all the information I want out of it (which could be another
irksome thing for a sequential_read of piped data; you either have to
keep all blobs indefinitely, or do all your parsing sequentially, making
us more like a push parser).
More information about the Bioperl-l