[Bioperl-l] SearchIO speed up

Sendu Bala bix at sendu.me.uk
Mon Aug 14 13:57:37 EDT 2006

aaron.j.mackey at GSK.COM wrote:
>> And then of course the idea is that this is nested, so the parser for 
>> the result data is a Bio::Search::Result::ResultI but also a pull-parser 
>> in its own right (and so on for HitI and HSPI) with a need for 
>> random-access to the various bits of data needed to answer all the 
>> various methods of ResultI.
> the second- (and third- and so on) level parsers can work on in-memory 
> "blobs" (if seeking is unavailable), as these will be minute in 
> comparison; it's only the top-level SearchIO parser that need fuss about 
> streaming pipes and seekability.

Oh, I'd disagree with that. A file given to SearchIO may only have 1 
result in it, but that single result could be 99.999% of the 1000MB 
file. That result might have only one hit, taking 99.99% of the file. 
And then the user might only be interested in the first hsp, which takes 
0.001 % of the file. You don't want to go around chucking in-memory 
blobs like those to your Result and Hit objects if you can avoid it.

>> I currently have a -piped_behaviour argument that accepts 'memory' or 
>> 'temp_file'.
> does it default to memory?

Yes, but the acceptable options and the defaults could vary for 
different pull-parser-based SearchIO modules. Since the goal here is 
increased speed of SearchIO, I'm tempted to say that even for a BLAST 
parser the default should be 'memory' (read everything in first).

> fundamentally, parsing occurs when regular expressions operate on 
> in-memory blobs; so while you can keep lots of file pointers around to 
> define many largish blobs with minimal memory footprint, at some point 
> they need to become memory-resident for the parser to take effect. 

I try to keep a good balance here. I also throw away a blob as soon as 
I've parsed all the information I want out of it (which could be another 
irksome thing for a sequential_read of piped data; you either have to 
keep all blobs indefinitely, or do all your parsing sequentially, making 
us more like a push parser).

More information about the Bioperl-l mailing list