[Bioperl-l] Next-gen modules

Chris Fields cjfields at illinois.edu
Wed Jun 17 14:40:05 EDT 2009

On Jun 17, 2009, at 1:09 PM, Tristan Lefebure wrote:

> Thanks both for the light.
> That probably means that the place bioperl will take in the
> handling of the next-gen sequencing raw data (i.e. reads) is
> very limited, nope? (at least until bioperl6). A single GA2
> solexa lane generates about 9 million reads, and I would
> really not called that a big project...

I don't think it's impossible.  If you parse any very long list of  
sequences in order it will be very slow, yes, but if they were indexed  
or loaded into a DB lookups would of course be magnitudes faster.

We already have perl-based indexing for fastq (Bio::Index::Fastq), so  
maybe something could be built on top of that. I haven't looked but we  
can also wrap other C/C++-based parsers as well. BioLib, for instance,  
has bindings to io_lib, so maybe that could be (ab)used in some way.

> BTW, is there a simple way to see object instantiation and
> inheritance, as well as time consumption for each, when once
> calls next_seq() (or any other method)?
> -Tristan

As a simple benchmark, at one point all feature tag information was  
converted into Bio::Annotations.  I reverted that behavior to be  
simple tag/value again and had a pretty decent bump:


Also, I tried reimplementing some parsers as generic 'event'-based  
driver/handler and they were slightly faster, the key roadblock being  
instantation again.  If I didn't create Features/Annotations I saw a  
significant speedup.  That's not entirely unexpected, as SeqFeatures  
also contain Locations (in turn that can contain subLocations) and  
(until recently) tag-based Bio::Annotation by default.  Annotations  
are collected in an Annotation::Collection and can contain other  
objects I believe (Ontology terms, etc).

The overall lesson is, if you don't have very heavy objects being  
created the overhead is actually quite small; it's only when you  
greedily instantiate everything that you run into problems.


More information about the Bioperl-l mailing list