[Bioperl-l] Next-gen modules

Elia Stupka e.stupka at ucl.ac.uk
Wed Jun 17 16:06:35 EDT 2009

Interesting that you mention the database issue. We found that for  
specific memory/CPU intenstive things we also switch to using dbs. For  
example, after many years of loyal use of disconnected_ranges we  
switched to a simple SQL implementation of it, because of the large  
performance gains it would give us.  Similarly in Ensembl as well as  
in the old days of bioperl-db we opted for doing subseq within SQL  
where possible.

Some lean way of SQL'izing specific components could be less  
"disruptive" than avoiding object creation and provide significant  
gains in performance. Could be set as an optional flag, and could use  
temporary ad hoc SQL databases?

Still, priority now is to make SeqIO compliant with all those formats,  
than we can worry about performance :)


On 17 Jun 2009, at 20:30, Chris Fields wrote:

> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
>> Tristan Lefebure wrote:
>>> Hello,
>>> Regarding next-gen sequences and bioperl, following my experience,  
>>> another issue is bioperl speed. For example, if you want to trim  
>>> bad quality bases at ends of 1E6 Solexa reads using  
>>> Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well,  
>>> you've got to be patient (but may be I missed some shortcuts...).
>> This is my concern as well. Or, rather, is there actually a  
>> significant set of users out there who are dealing with next-gen  
>> sequencing and would consider using BioPerl for their work?
>> I'm working with all the 1000-genomes data at the Sanger, and we at  
>> least are probably never going to use BioPerl for the work.
> Are you using pure perl or (gasp) something else?  ;>
> Judging by the feedback there are definitely a set of users who  
> would like to integrate nextgen into bioperl somehow, probably to  
> take advantage of other aspects of bioperl.
>>> A pure perl solution will be between 100 to 1000x faster... Would  
>>> it be possible to have an ultra-light quality object with few  
>>> simple methods for next-gen reads?
>> The fastq parser itself already seems pretty fast. The way to get  
>> the speedup is to not create any Bio::Seq* objects but just return  
>> the data directly. At that point it's not taking much advantage of  
>> BioPerl. But certainly it could be done...
> I suppose the best way to assess what needs to be done is come up  
> with a set of 'use cases' specifying what users want so we can  
> design around them, otherwise we're shooting in the dark.
> I'm personally wondering if this could be done as a sequence  
> database, something similar in theme to Lincoln's SeqFeature::Store,  
> but sequence only, and returns quality objects in a similar manner  
> (ala Storable)?  Not sure whether that's feasible, but it's appears  
> at least scalable.
> chris
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801

More information about the Bioperl-l mailing list