[Bioperl-l] bp_bulk_load_gff.pl speed
dustin.cram at gmail.com
Wed Jul 14 19:51:21 EDT 2004
Heh, I was sure I had to be missing something obvious too, glad to see
someone else has noticed this.
I'll have to wait till I go in to work tomorrow to check exact
versions, but MySQL is 3.23.x, perl is 5.8.x, and OS is Redhat 9.
On Wed, 14 Jul 2004 19:10:39 -0400, Aaron J. Mackey
<amackey at pcbi.upenn.edu> wrote:
> Aha, I'm *not* crazy! I've experienced exactly this same behavior (I
> ended up "solving" it by batching loading in blocks of 500, which
> worked fine until my database got very big such that the initial group
> loading got too slow).
> What's your mysql version, perl version (usemymalloc?), and OS? I
> think this is a perl hash/memory issue, but I'd love to solve it now
> that I know it's not just something stupid I'm doing wrong.
> On Jul 14, 2004, at 6:22 PM, Dustin Cram wrote:
> > I recently started using Bio:DB:GFF, beginning by using
> > bp_bulk_load_gff.pl to load a simple but large gff2 file. This file
> > consisted only of transcripts and their subfeatures, so the group
> > class of all features was "transcript". The files loaded with no
> > problem and I was able to write a few successful test scripts.
> > Now I have added new features (genes) to the gff file, and I
> > attempted to load the new file exactly as before with
> > bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
> > more time the more features are added (the first 5K features take
> > about 30 seconds, the next 5K features take nearly 2 minutes, and so
> > on). It took over an hour to 50K features, at which point I stopped
> > it.
> > I've played around with the gff file a bit and found that anything
> > that doesn't have a group class of "transcript" has this problem, for
> > example if I 'sed s/transcript/foo/g' the original file it's slow,
> > and if I 'sed s/gene/transcript/g' the new file it's fast. I have
> > manually verified that the MySQL database is empty before each attempt
> > and even wiped the tmp directory before each attempt.
> > Any ideas why non-transcript features take so long?
> > Thanks,
> > Dustin Cram
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> Aaron J. Mackey, Ph.D.
> Dept. of Biology, Goddard 212
> University of Pennsylvania email: amackey at pcbi.upenn.edu
> 415 S. University Avenue office: 215-898-1205
> Philadelphia, PA 19104-6017 fax: 215-746-6697
More information about the Bioperl-l