[Bioperl-l] SearchIO speed up
cjfields at uiuc.edu
Fri Aug 11 12:33:52 EDT 2006
If we go the route of flexibility (so one could use full-blown objects,
hashes, lazy parsing, etc.), maybe we should initially have custom Result*,
Hit*, HSP* Bio::Search objects returned via the Handler initially. This
would allow you to commit everything and get people testing it on various
OS's. You could also develop a custom handler but that isn't absolutely
necessary (see below).
The various Handlers apparently are set up for allowing one to create a
custom Factory for each Search object type (such as BLAST*). These are
added to the Handler upon instantiation or by using register_factory(). The
modified Handler can then be added using SearchIO's attach_EventHandler().
So I guess one could do something like this:
my $resfac = Bio::Factory::ObjectFactory->new(
-type => 'Bio::Search::Result::LazyResult',
-interface => 'Bio::Search::Result::ResultI');
my $hitfac = Bio::Factory::ObjectFactory->new(
-type => 'Bio::Search::Hit::LazyHit',
-interface => 'Bio::Search::Hit::HitI');
my $hspfac = Bio::Factory::ObjectFactory->new(
-type => 'Bio::Search::HSP::LazyHSP',
-interface => 'Bio::Search::HSP::HSPI');
my $handler = Bio::SearchIO::SearchResultEventBuilder->new(
-result_factory => $resfac,
-hit_factory => $hitfac,
-hsp_factory => $hspfac);
my $parser = Bio::SearchIO->new(-file => $file,
-format => 'lazyblast');
# proceed with parsing...
Of course I haven't tried this out... ;>
Would be nice to add a parameter that allows one to add a modified handler
upon SearchIO object instantiation. Oh well...
Most users don't know nor use the various handlers or know about the Search
objects, which is a shame. Maybe the HOWTO needs to be written more
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Sendu Bala
> Sent: Thursday, August 10, 2006 5:29 PM
> To: bioperl-l at bioperl.org
> Subject: Re: [Bioperl-l] SearchIO speed up
> aaron.j.mackey at gsk.com wrote:
> >> As I understand your description, this is exactly what I do. My
> >> are the hashes that are normally used to create a new Hit/HSP object.
> >> The initial parse of the data file results in a small number of objects
> >> (Results) that contain all the data: HSP data nested in Hit data nested
> >> in the Result objects. When you actually want to do something with a
> >> certain hit or HSP it becomes an object, allowing you to call its
> >> methods like normal.
> >> Or are you suggesting something that would be even better than that? If
> >> so, please elucidate! :)
> > So the only lazyness you invoke is the object instantiation (but you've
> > already done all the parsing).
> > My proposal involves the "chunks" being unparsed, raw text "blobs", that
> > are essentially blessed into a package that does the parsing only when
> > necessary (and even then, might choose different parsing strategies,
> > on what's been asked for). Thus a potentially large amount of parsing
> > storage is skipped. Additionally, you now have the option of not even
> > storing the blobs in memory, just file seek pointers (requiring temp.
> > storage for streaming pipe data sources), and thus can process very
> > reports without consuming memory (currently a problem).
> Thanks, I might try out something along those lines. The problem I see
> is with piped input; I wouldn't want to require temp. storage because
> the user may deliberately be trying to gain speed by doing as little
> disc io as possible. Then you'd have to special-case it; pointers if we
> have a file on disc, stored-in-memory if piped. Maybe that special-case
> wouldn't be so bad.
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l