[Bioperl-l] Next-gen modules
cjfields at illinois.edu
Wed Jun 17 14:40:05 EDT 2009
On Jun 17, 2009, at 1:09 PM, Tristan Lefebure wrote:
> Thanks both for the light.
> That probably means that the place bioperl will take in the
> handling of the next-gen sequencing raw data (i.e. reads) is
> very limited, nope? (at least until bioperl6). A single GA2
> solexa lane generates about 9 million reads, and I would
> really not called that a big project...
I don't think it's impossible. If you parse any very long list of
sequences in order it will be very slow, yes, but if they were indexed
or loaded into a DB lookups would of course be magnitudes faster.
We already have perl-based indexing for fastq (Bio::Index::Fastq), so
maybe something could be built on top of that. I haven't looked but we
can also wrap other C/C++-based parsers as well. BioLib, for instance,
has bindings to io_lib, so maybe that could be (ab)used in some way.
> BTW, is there a simple way to see object instantiation and
> inheritance, as well as time consumption for each, when once
> calls next_seq() (or any other method)?
As a simple benchmark, at one point all feature tag information was
converted into Bio::Annotations. I reverted that behavior to be
simple tag/value again and had a pretty decent bump:
Also, I tried reimplementing some parsers as generic 'event'-based
driver/handler and they were slightly faster, the key roadblock being
instantation again. If I didn't create Features/Annotations I saw a
significant speedup. That's not entirely unexpected, as SeqFeatures
also contain Locations (in turn that can contain subLocations) and
(until recently) tag-based Bio::Annotation by default. Annotations
are collected in an Annotation::Collection and can contain other
objects I believe (Ontology terms, etc).
The overall lesson is, if you don't have very heavy objects being
created the overhead is actually quite small; it's only when you
greedily instantiate everything that you run into problems.
More information about the Bioperl-l