[Bioperl-l] Proposed bioperl module for local running of the NCBI standalone blastpackage

David Block dblock@gene.pbi.nrc.ca
Mon, 16 Oct 2000 09:42:19 -0600 (CST)

On a related note, check http://bioinfo.pbi.nrc.ca/dblock/wiki for
ParallelBlast, a project we have been working on to dynamically split the
database and blast hits on a cluster (32 nodes right now).  Pretty much
linear speedup, and automatic parsing by the server.  It's amazing, and it
will be open source soon, as well.

I'll keep you posted.


On Mon, 16 Oct 2000, Hilmar Lapp wrote:

> Peter Schattner wrote:
> > 
> > So instead I propose to write a relatively "light weight"  bioperl
> > wrapper module for running the NCBI standalone blast package.  Its
> > format would be similar to that of the Clustalw.pm module. I believe its
> > approach would also be similar to that of the Jeff Chang's biopython
> > NCBIStandalone.py module (thanks to Brad Chapman for bringing this
> > module to my attention).
> > 
> > The syntax of the proposed module would involve creating a local blast
> > "factory object". The constructor would be passed the name of the blast
> > method and database to be used, the desired method for parsing the blast
> > report (Blast or BPlite) and an optional array of (non-default)
> > parameters to be used by the factory, eg:
> > 
> > @params = ('method' => 'blastn', 'database' => 'ecoli.nt','outformat' => 'BPlite');
> > $factory = Bio::Tools::StandAloneBlast->new(@params);
> > 
> Sounds good to me, and is certainly useful. We (and certainly a lot of
> others :) are already calling the stand-alone BLAST from within Perl as a
> system call, but your proposal is certainly much more transparent and
> re-usable, and I like the factory idea. 
> Basically, I have one comment. It would be very helpful if such a module
> could also support running stand-alone BLASTs in parallel, e.g., if
> you've got a multi-processor machine. I know that the current NCBI BLAST
> supports multi-threading, but on well-equipped machines it often scales
> better to run multiple processes. So, the idea is then that I can pass an
> array of Seq objects and these will be run in parallel, returning an
> array of BPlite or Blast.pm objects. At the low-level, there may be a
> memory issue if the array is a few hundreds or thousands seqs long (which
> it is for us). So, instead of returning a full array of result objects,
> one may consider a callback invoked for each finished report.
> 	Hilmar

David Block
Plant Biotechnology Institute
National Research Council of Canada
Saskatoon, Saskatchewan