Bioperl: Bio::Tools::Blast vs. Bio::GSC::Tool::Blast
Mon, 8 Jun 1998 12:18:05 -0500
I just checked out Steve's Bio::Tools::Blast and thought I'd make a comparison
to my Bio::GSC effort. First off, to answer Ewan's questions: there is lots of
overlap between Steve's and my Blast modules and yes, we talked briefly about
merging our efforts, but nothing came of it. I'd really like to collaborate
with others, but it isn't easy to do.
Steve passes named parameters in an anonymous hash, whereas I have a strict
order of arguments. Is there any reason to prefer one over the other? Should
all Bioperl modules have a consistent syntax? I make liberal use of constants.pm
for parameters. This has the downside of polluting the namespace of an importing
package, but it gives compile-time name-checking. Any thoughts on this?
Steve's looks much more comprehensive. Very nice. Lots of examples too.
Philosopy (for lack of a better term)
Me first. I have tried to make Bio::GSC::Feature an abstract class to encompass
any sequence analysis tool that creates features. A feature has a begin and
and end and possibly many other attributes. So a gene prediction algorithm
would have exon prediction features. But some features are complex: a gene
feature may contain exon features. In a BLAST report, the individual HSPs are
sequence features that are combined in a subject sequence feature. So one can
ask for the begin of an HSP by $hsp->getBegin or the begin of a subject by
$sbjct->getBegin. Some programs actually spit out both simple and complex
features (eg. Genscan makes gene predictions, which contain exons, and also
reports suboptimal exons not contained in any gene structure).
Steve looks like he's taken a less breadth and more depth approach. His HSP,
Sbjct, and Blast objects are specific to BLAST (or probably some other pairwise
alignment algorithm with little modification). I'm not sure if his modules
offer more functionality than mine, but I suspect they do because his object
isn't as general as mine. For example, if you want to know the gap penalty
in a report, you can get it from his, but not mine. You can also run BLAST
locally or remotely from his modules (very cool).
We have both decided that one should be able to filter a report as it is read
to preserve memory and cut down on parsing time. Both modules offer a flexible
way of doing this. In Steve's, you can filter on P or pass in a function ref
for just about anything. In mine, you pass in a filtering statement that gets
massaged and eval'd. For example, you can pass in something like this:
"Length < 1000 and (P < 1e-5 or Percent > 98)"
Because features can be nested, one can filter at multiple levels (eg. you can
filter HSPs separately from Sbjcts).
I've tried to make Bio::GSC::Tool as general as possible. One can parse BLAST
reports or anything else supported (like RepeatMasker or Genscan). The filtering
functions can be used on any tool. For example, one can read in a RepeatMasker
report and grab only the Alu elements (or the reverse). In addition, there are
functions for comparing one feature to another. So one can ask where are my
BLAST hits like my Genscan exons. Also, it is easy to compare the outputs of
similar programs to determine which one is working better, or if different
parameters or versions created different outputs. By comparing features within
a report, one gets clustering. We use this at the GSC to automatically report
only the best ESTs with unique topologies.
I have intentionally separated the running and processing of tools, but I may
integrate them as Steve has. It seems a natural thing to do.
Steve, more comments?
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc: