[Bioperl-l] Next-gen modules
cjfields at illinois.edu
Tue Jun 30 13:46:27 EDT 2009
On Jun 30, 2009, at 6:28 AM, Giles Weaver wrote:
> I'm developing a transcriptomics database for use with next-gen
> data, and
> have found processing the raw data to be a big hurdle.
> I'm a bit late in responding to this thread, so most issues have
> been discussed. One thing that hasn't been mentioned is removal of
> from raw Illumina sequence. This is a PITA, and I'm not aware of any
> developed and documented open source software for removal of
> adapters (and
> poor quality sequence) from Illumina reads.
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for
> the Illumina fastq to Sanger fastq, bioperl to read the quality
> values, pure
> perl to trim the poor quality sequence from each read, and bioperl
> emboss to remove the adapter sequence. I'm aware that the pipeline
> bugs and would like to simplify it, but at least it does work...
My local bioperl is working with FASTQ parsing of Sanger and Illumina
(but not solexa yet). I'll commit what I have today, and we should be
able to add in solexa soon. We'll also need to add in write_seq
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a
> lack of
> features and poor performance. I'm sure the features will come with
> but the performance is more of a concern to me. I wonder if
> Bio::Moose might
> be used to alleviate some of the performance issues? Might next-gen
> be an ideal guinea pig for Bio::Moose?
We should get FASTQ working in core first then optimize on speed (as
Elia previously pointed out). We can do that within the actual SeqIO
parser using a few simple tricks. For instance my local
Bio::SeqIO::fastq has a reconfigured next_seq to call an iterator that
returns raw processed data as a simple hash ref; users have access to
that method, so if one wanted they could retrieve the raw data
directly, or pass it through a filter that only creates seq instances
one wants on the fly (that would be where your quality checks, adaptor
modification, etc. fit in).
In the end it might be to wrap a C/C++-based solution for speed. As
mentioned previously a C-based parser exists from Sanger Centre that
we could incorporate in some fashion, but I would like if it were able
to report back file position for fast indexing. The code is fairly
simple so it should be too hard to incorporate that in somehow.
Just so there is no confusion, Bio::Moose is an attempt to both lay
out plans for perl6 and deal with inheritance issues within bioperl
now. It's still in very early development and may not see a release
until Dec. at the very earliest, it will be an alpha release then, and
likely won't have every major class represented at that point. It's
also not intended to be backwards-compatible with bioperl core. It
may help, but that's not an absolute certainty. As for bioperl6, it
will be pre-alpha until perl6 spec reaches a stable draft and we have
an active implementation.
> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
> - next-gen sequence quality parsing (to output phred scores)
> - sequence quality based trimming
> - sequencing adapter removal
> - filtering based on sequence complexity (repeats, entropy etc)
> - bioperl-run modules for bowtie etc.
> Obviously all of these need to be fast!
> I'd love to muck in, but I doubt I'll contribute much before
> Bio::Moose/bioperl6, as the (bio)perl object system gives me
One can only read a file so fast (even with a highly optimized C/C++
based parser), but I don't think that will be the limiting factor as
much as object instantiation.
> Regarding trimming bad quality bases (see comments from Tristan
> from Solexa/Illumina reads, I did find a mixed pure/bioperl solution
> to be
> much faster than a primarily bioperl based implementation. I found
> Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow.
> current code trims ~1300 sequences/second, including unzipping the
> raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.
Right, hence coming up with a 'pre-filter' for raw data (hash refs)
prior to object instantiation to speed things up. This will be a bit
easier with Bio::Moose as we can introspect attributes via the meta
class, but this will be a while yet.
> Hope this looooong post was of interest to someone!
It's always good to hear about such issues and what one expects.
More information about the Bioperl-l