[Bioperl-l] Re: [Bioclusters] BioPerl and memory handling

Jason Stajich jason.stajich at duke.edu
Tue Nov 30 08:46:19 EST 2004

That's true - it does create a lot of objects for all the compnents of 
the report.  When you have 2000 hits it needs to build quite a few 
objects.  It does build them all for a single result.  Steve had a lazy 
parser implementation in at one point, but that was more for speed when 
you didn't want to actually see the HSP details for every hit.

I second Ian's comment that I use the tabular output from BLAST when 
dealing with large datasets.  SearchIO is intended to give you access 
to the entire data in the report, so there is an overhead in that.

There are a couple of workarounds depending on what kind of data you 
want. We designed SearchIO to be a modular system which separates 
parsing the data from instantiating objects by throwing events (like 
SAX) and having a listener build objects from these events.  One can 
instantiate a different listener which builds simpler objects or throws 
away the data you don't want.  At some point I hope we can build some 
light-weight Result/Hit/HSP objects and a listener which creates these 
instead of full-fledged bioperl objects.  You can build your own 
listener object - SearchResultEventBuilder and FastHitEventBuilder are 
2 implementations and you can specify the type of Result/Hit/HSP 
objects that are created by the listeners.  It might be easiest to 
create some lightweight Hit and HSP objects and have  
SearchResultEventBuilder create these instead of the default 
full-fledged ones.  At some point though, if you are getting 5-10k hits 
I don't think the parser is going to play nice as it wasn't really 
engineered with this extreme case in mind.

Now the whole parser/listener design assumes that you want to process 
all the data for a result before moving on to the next one - at least 
from the listener's standpoint this means you have to store all the 
data you just got from the parser - whether this is in memory, or 
potentially stored in a tempfile/temp dbfile would be up to the 

Here is an example of how you can provide a different listener - 
FastHitEventBuilder just throws away the HSPs and only builds Result 
and Hit objects.

  use Bio::SearchIO;
  use Bio::SearchIO::FastHitEventBuilder;

   my $searchio = new Bio::SearchIO(-format => $format, -file => $file);


   while( my $r = $searchio->next_result ) {
    while( my $h = $r->next_hit ) {
     # note that Hits will NOT have HSPs

On Nov 30, 2004, at 5:59 AM, Michael Maibaum wrote:

> On Tue, Nov 30, 2004 at 01:24:24AM -0800, Steve Chervitz wrote:
>> Regarding SearchIO memory usage, I don't think this has been an issue
>> before, so I wonder if there is something about the installation or 
>> specific
>> usage of it that is leading to memory hogging. I've run it over large
>> numbers of reports without noticing troubles. It would be useful to 
>> see a
>> sample report + script using SearchIO that leads to the memory 
>> troubles, so
>> we can try to reproduce it.
> FWIW - I at least didn't have a problem parsing many thousands of 
> results in a stram with SearchIO - I had a problem with parsing 
> certain specific result sets, Essentially anything with about 2000 
> hits and alignments (or more) for a single query would kill a linux 
> box with 1 gig of RAM (it would thrash VM to death). These would run 
> on a opteron 16Gig box and used >8 gig of RAM in some cases.
> As far as I can see the majority of the memory was then returned when 
> BioPerl moved on to the next record. The issue is that it takes a 
> rather large amount or RAM for an individual record and I assumed 
> (rightly or wrongly) that BioPerl slurps up the entire record and 
> builds the objects representing it as a whole hence the large RAM 
> usage. It may be that the objects to represetn 2000+ hits are just 
> very (unreasonably?) large.
> Michael
Jason Stajich
jason.stajich at duke.edu

More information about the Bioperl-l mailing list