[Bioperl-l] Next-gen modules
e.stupka at ucl.ac.uk
Wed Jun 17 16:06:35 EDT 2009
Interesting that you mention the database issue. We found that for
specific memory/CPU intenstive things we also switch to using dbs. For
example, after many years of loyal use of disconnected_ranges we
switched to a simple SQL implementation of it, because of the large
performance gains it would give us. Similarly in Ensembl as well as
in the old days of bioperl-db we opted for doing subseq within SQL
Some lean way of SQL'izing specific components could be less
"disruptive" than avoiding object creation and provide significant
gains in performance. Could be set as an optional flag, and could use
temporary ad hoc SQL databases?
Still, priority now is to make SeqIO compliant with all those formats,
than we can worry about performance :)
On 17 Jun 2009, at 20:30, Chris Fields wrote:
> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>> Tristan Lefebure wrote:
>>> Regarding next-gen sequences and bioperl, following my experience,
>>> another issue is bioperl speed. For example, if you want to trim
>>> bad quality bases at ends of 1E6 Solexa reads using
>>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,
>>> you've got to be patient (but may be I missed some shortcuts...).
>> This is my concern as well. Or, rather, is there actually a
>> significant set of users out there who are dealing with next-gen
>> sequencing and would consider using BioPerl for their work?
>> I'm working with all the 1000-genomes data at the Sanger, and we at
>> least are probably never going to use BioPerl for the work.
> Are you using pure perl or (gasp) something else? ;>
> Judging by the feedback there are definitely a set of users who
> would like to integrate nextgen into bioperl somehow, probably to
> take advantage of other aspects of bioperl.
>>> A pure perl solution will be between 100 to 1000x faster... Would
>>> it be possible to have an ultra-light quality object with few
>>> simple methods for next-gen reads?
>> The fastq parser itself already seems pretty fast. The way to get
>> the speedup is to not create any Bio::Seq* objects but just return
>> the data directly. At that point it's not taking much advantage of
>> BioPerl. But certainly it could be done...
> I suppose the best way to assess what needs to be done is come up
> with a set of 'use cases' specifying what users want so we can
> design around them, otherwise we're shooting in the dark.
> I'm personally wondering if this could be done as a sequence
> database, something similar in theme to Lincoln's SeqFeature::Store,
> but sequence only, and returns quality objects in a similar manner
> (ala Storable)? Not sure whether that's feasible, but it's appears
> at least scalable.
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374
Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801
More information about the Bioperl-l