[Bioperl-l] Bio::Index::Blast bug
cjfields at illinois.edu
Wed Mar 10 10:19:42 EST 2010
On Mar 10, 2010, at 8:40 AM, Peter wrote:
> On Wed, Mar 10, 2010 at 2:27 PM, Chris Fields wrote:
>> On Mar 10, 2010, at 4:35 AM, Peter wrote:
>>> On Wed, Mar 10, 2010 at 8:20 AM, Till Bayer wrote:
>>>> Hi all!
>>>> I tried to use Bio::Index::Blast, but always got the first hit back, no
>>>> matter what ID I used. The reason is that the Blast indexer seems to use
>>>> 'BLAST' as a record separator in all cases, except for RPS-BLAST.
>>>> I think however that for the current versions of blastall and blast+
>>>> 'Query=' should be used.
>>> That fits with changes I had to make in Biopython for breaking
>>> up the plain text BLAST output into each query. For a while only
>>> the RPS-BLAST report omitted the "header" (the BLAST line
>>> and the journal references users should cite) between records,
>>> but now all the NCBI BLAST tools do this - forcing us to look
>>> for the Query= line.
>>> i.e. I can't comment on the BioPerl change itself, but your
>>> reasoning about the BLAST output makes sense.
>> One side-effect of this is we will be missing the search
>> algorithm and a few small odds and ends from all but
>> the first report; this trickles down into how we properly
>> deal with HSP coordinates, but we can probably wrangle
>> some magic there to get things working for the most part.
> Yeah - I had similar issues with the Biopython plain
> text BLAST parser. The hack/magic I used was to
> cache the header text from the first record and then
> re-insert it on subsequence records. Nasty, but works.
Right, but here's the side-effect: unless that data is somehow stored when indexing, it will not be caught if one starts an IO stream at any point past the BLAST header (in other words, all but the first report). We could, in effect, store that as meta information somehow (I think Index may have some meta storage), or just parse it prior to initiating the stream and pass the information into the IO object.
>> This is similar to how XML format is currently dealt with
>> (and another reason this format is the easiest to support,
>> as it doesn't change based on NCBI's whims).
> They may have changed a few things here too - watch out.
>> Do we have example reports with multiple queries from
>> BLAST+ available? It would be invaluable for the projects;
>> if not I can probably generate a few locally.
> I've got one example in Biopython's unit tests,
Okay, will start up some work to work out tests, etc.
More information about the Bioperl-l