[Bioperl-l] SearchIO speed up

Amir Karger akarger at CGR.Harvard.edu
Fri Aug 11 09:06:06 EDT 2006

Let me add my voice to the adulation here. IMO, the two main reasons
Bioperl hasn't achieved world domination are (a) it's so huge that it's
hard to find what you want, which the HOWTOs help with, and (b) it's so
darn slow. Speedup is most definitely a Good Thing, and I'm sure that
the vast majority of BLAST hits are ignored in the vast majority of
cases, where you're just looking for hits where some criterion meets a
certain threshold or something. It's unlikely that people want the full
alignment for all 100k or whatever hits. (This is why I just use blast
-m8: no parser required, and all you lose is the alignment.)

Anyway, in your spare time, maybe you do similar speedups for other
pieces of Bioperl? My personal favorite would be the GenBank/EMBL
parsers. The fungal genome ORF files I'm working with are only 20M or
so, but using Bioperl to work with them takes so much longer than with
non-Bioperl on the 6M FASTA files for other genomes. I have to imagine
it's mostly creating objects for the gazillion tags, 90% of which I
never peek at.

I know, you folks are busy, and I should be volunteering to do it
myself. But you can at least consider it a user request.

- Amir Karger
Research Computing
Bauer Center for Genomics Research
Harvard University

> -----Original Message-----
> From: aaron.j.mackey at gsk.com [mailto:aaron.j.mackey at gsk.com] 
> Sent: Thursday, August 10, 2006 1:40 PM
> To: Sendu Bala
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] SearchIO speed up
> > ...Except I need to know if the community considers the 
> speed problem 
> > solved or not. More radical changes will make SearchIO even 
> faster, eg. 
> > Chris Fields and Jason (if I interpret the Project priority 
> list item 
> > correctly) have suggested an end to individual Hit and HSP objects, 
> > which become just data members of a Result-like object. 
> Ideally I don't 
> > want to go down that route because we lose quite a bit of OO power;
> As already mentioned, a lazy-evaluation approach would also work.
> Jason and I did once talk about an entirely new 
> parsing/object-building 
> framework, based on nested grammars; in essence, the 
> "top-level" parser, 
> simply "chunks" the input into blobs of (minimally parsed) text that 
> correspond to the top level result object.  This chunk/blob 
> is the input 
> to the next-level parser for Hits, which in return has chunk 
> for HSPs. 
> Note that the Result/Hit/HSP "chunks" are "fat", i.e. they 
> *are* the same 
> Generic*I-implementing objects we're already using.  Thus, if 
> HSPs are 
> never interrogated, they're never parsed; as soon as one is 
> interrogated, 
> it gets parsed, and so on.  In such an environment, you can imagine 
> flyweight objects that are built very quickly/easily (recall 
> that many 
> previous analyses of BioPerl speed problems are not related 
> to parsing, so 
> much as heavy-weight object creation).
> I happen to have such a nested parser lying around for 
> Bio::SearchIO::fasta.pm, but it also uses an Inline::C, 
> yacc-generated C 
> parser backend (yet another experiment in trying to get 
> SearchIO to run 
> faster), so really isn't ready for prime time (being entirely 
> untested, 
> and probably not even finished).
> -Aaron

More information about the Bioperl-l mailing list