[Bioperl-l] Bio::Assembly bug/feature?

Chris Fields cjfields at uiuc.edu
Mon Jul 23 11:41:35 EDT 2007


To all:

I think I have found a major problem with Bio::Assembly; this was  
first noticed on Mac OS X in relation to bug 2320 and  
Bio::Assembly::IO.  I am uncertain whether this is meant to be a  
feature or a bug but it certainly needs to be documented or fixed as  
it leads to subtle errors.  I also can't see the advantage of this  
approach, but maybe I can be enlightened?  Either way, I think it's  
worth a discussion for those willing to follow.  I'll add as a bug  
later if needed.

A bit of background: each instance of a Bio::Assembly::Contig has a  
Bio::SeqFeature::Collection instance attached to it; each  
Bio::SeqFeature::Collection itself has a tied DB_File handle attached  
which remains open during the lifetime of the Bio::SF::Collection  
object.  When using Bio::Assembly one adds the various Contig objects  
to a Bio::Assembly::Scaffold.  So, for instance, if one had ~1000  
Contigs in a Scaffold, one would also have ~1000 open tied db  
handles, one per Contig instance.  So far, so good.

Unfortunately, when adding a ton of Contig objects to a  
Bio::Assembly::Scaffold one can run into a host of system-dependent  
issues based on resource usage limits (as one might expect).  This  
script:

------------------------------
use Bio::Assembly::Scaffold;
use Bio::Assembly::Contig;
use Bio::SeqFeature::Generic;

my $scaffold = Bio::Assembly::Scaffold->new();

for my $id (1..15000) {
     print "Contig #$id\n";
     my $contig = Bio::Assembly::Contig->new(-id => $id);
     my $feat = Bio::SeqFeature::Generic->new(-start=>1,
                                            -end=>10,
                                            -strand=>1);
     $contig->add_features([$feat]);
     $scaffold->add_contig($contig);
}
------------------------------

may fail on Mac OS X when one reaches the maximum number of open file  
descriptors possible on Mac OS X (on UNIX'y systems, this is 'ulimit - 
n'); the call to tie the DB_File handle in SF::Collection fails  
silently, so later on when called on you get the following:

...
Contig #251
Contig #252
Contig #253
Contig #254
Can't call method "put" on an undefined value at /Users/cjfields/src/ 
bioperl-live/Bio/SeqFeature/Collection.pm line 225.

I have added an exception to catch this.  On Mac OS X you can  
increase the file descriptor limit using ulimit, at least to a  
certain point.  However, when testing this out on dev.open-bio.org  
(Linux) the 'tie' sometimes fails (and the exception pops up), but it  
isn't dependent on 'ulimit -n'.  This is what happens more often:

...
Contig #10567
Contig #10568
Contig #10569
Contig #10570
Out of memory!

Sometimes followed by a seg fault.  Ick!

Any ideas? For instance, should we set this up so that one  
SF::Collection is used for all the Contigs (since each one has a  
unique ID anyway)?  Leave as is and document/track the issue as a  
bug?  Both?

chris


More information about the Bioperl-l mailing list