[Bioperl-l] bp_bulk_load_gff.pl speed

Dustin Cram dustin.cram at gmail.com
Thu Jul 15 17:30:23 EDT 2004

Well, I think I've traced my problem to a bug in
Bio::DB::GFF->_split_gff2_group that only existed for a while in CVS.
I had assumed that release 1.4 was installed at our site, but it turns
out that it was a cvs for shortly after the 1.4 release.  The revision
of Bio::DB::GFF.pm with the problem is 1.105 (maybe others too).

It looks to me like $self->preferred_groups is being appended to with
("Sequence",Transcript") for every call of the method, so as time goes
by the array gets huge, with just those elements repeated over and
over.  That is why only my non-transcript features had problems - the
entire array was searched unsuccessfully for each feature.

I've grabbed the latest CVS and it seems to work fine.  Although I
haven't tried 1.4 release,  I think it should work too.  If this isn't
the problem for other folk, then I guess they're still just crazy :).



On Thu, 15 Jul 2004 16:36:36 -0400, Scott Cain <cain at cshl.edu> wrote:
> Dustin,
> Besides Aaron, a few other people have complained about this, and yes, I
> had written them off as crazy :-)
> Since I can't reproduce this problem, I'll have to ask you: is the
> problem that the files are not being written to /usr/tmp (or where ever)
> as quickly as before, or is it that, after the files are done being
> written, they aren't loaded into mysql as quickly?  Not that I have a
> solution to either problem, but the first is presumably a perl problem
> and the second a mysql problem.  If it were the latter (which I kind of
> doubt), you could get around it by using a real database, like
> PostgreSQL.
> Scott
> On Thu, 2004-07-15 at 13:45, bioperl-l-request at portal.open-bio.org
> wrote:
> >
> > I recently started using Bio:DB:GFF, beginning by using
> > bp_bulk_load_gff.pl to load a simple but large gff2 file.  This file
> > consisted only of transcripts and their subfeatures, so the group
> > class of all features was "transcript".  The files loaded with no
> > problem and I was able to write a few successful test scripts.
> >
> > Now I have added  new features (genes) to the gff file, and I
> > attempted to load the new file exactly as before with
> > bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
> > more time the more features are added (the first 5K features take
> > about 30 seconds, the next 5K features take nearly 2 minutes, and so
> > on).  It took over an hour to 50K features, at which point I stopped
> > it.
> >
> > I've played around with the gff file a bit and found that anything
> > that doesn't have a  group class of "transcript" has this problem, for
> > example if I 'sed s/transcript/foo/g'  the original file it's slow,
> > and if I 'sed s/gene/transcript/g' the new file it's fast.  I have
> > manually verified that the MySQL database is empty before each attempt
> > and even wiped the tmp directory before each attempt.
> >
> > Any ideas why non-transcript features take so long?
> >
> > Thanks,
> >
> > Dustin Cram
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                         cain at cshl.org
> GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
> Cold Spring Harbor Laboratory
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list