[Bioperl-l] SearchIO speed up

Sendu Bala bix at sendu.me.uk
Mon Aug 14 12:24:58 EDT 2006

aaron.j.mackey at gsk.com wrote:
>> User requests report-statistic Y, which is found on the last line of the 
>> report. We want to avoid reading, storing and parsing the entire file 
>> just to find Y, so we seek to the last line, parse Y out and return it. 
>> Yay, super fast.
> This was the bit I was missing, thanks; to be honest, I never knew we had 
> a get_result(Y) method, I thought we only had next_result() iterators.  Oh 
> wait, we don't, but you're proposing we should extend the API to offer 
> one?

It's subtle. There's no explicit methods defined at the SearchIO level, 
but currently you have to parse data (or not - we want to pull) to find 
out things that all result (or even hit, hsp) objects need. You may need 
to do some internal, optional parsing depending on the specific file 
format variation you discover you are parsing.

And then of course the idea is that this is nested, so the parser for 
the result data is a Bio::Search::Result::ResultI but also a pull-parser 
in its own right (and so on for HitI and HSPI) with a need for 
random-access to the various bits of data needed to answer all the 
various methods of ResultI.

> The reason I'm being so fussy about this is that a primary motivation for 
> a shockingly-fast parser is shockingly large datasets that we keep only as 
> compressed files, uncompressing them en route to the parser; thus your 
> simple "I'll just copy the stream to tempfile and proceed as normal" 
> solution is not so trivial.

Right, that's helpful. I'll keep that in mind.

> Here's a compromise: assume that users won't need random access to their 
> results, only sequential; also, provide a new parameter to the searchIO 
> constructor to specifify the desired access mode as random; then, if the 
> input stream is not seekable (which is testable), you can perform your 
> memory/file caching.  If get_result(X) is called without the access mode 
> being set to random on an unseekable stream, throw an (informative) error.

I currently have a -piped_behaviour argument that accepts 'memory' or 
'temp_file'. How about a third (non-default) option of 'linear' to avoid 
any attempt at a seek and just use the data as it is piped? The trouble 
is that you'd need to virtually implement the methods of a parser module 
twice, once where the methods can seek, second where they can't. Or 
maybe not; I'll have to try and see if some sane compromise 
implementation is possible.

More information about the Bioperl-l mailing list