[Bioperl-l] Help parsing PSI-BLAST XML reports

Chris Fields cjfields at uiuc.edu
Thu Apr 5 00:14:46 EDT 2007


On Apr 4, 2007, at 8:34 PM, Torsten Seemann wrote:

> Dear all,
>
> I have been migrating all our BLAST infrastructure to use the XML
> output mode, the "blastpgp -m 7" option, referred to 'blastxml' format
> in Bioperl. I had never used SearchIO to parse a PSI-BLAST XML report
> before, and encountered some issues I hope you can help me with:
>
> 1. When loading with Bio::SearchIO(-format=>'blastxml') I get back a
> Bio::Search::Result::GenericResult object. This means I can not use
> the PSI-BLAST functions like iterations() and psiblast() provided by
> Bio::Search::Result::BlastResult. I'm guessing this is because the the
> XML output reports itself as a plain BLASTP output:
> <BlastOutput_program>blastp</BlastOutput_program>
>
> How do I determine if it is a PSI-BLAST report?

I don't know if you can very easily, though I haven't tried myself.   
If I remember correctly there wasn't a substantial difference in the  
XML output between regular BLAST XML and PSI-BLAST XML.  We could add  
a parameter to the parser to treat the report as PSI-BLAST.

> 2. Usually a PSI-BLAST report has multiple Iterations. The XML output
> has <Iteration> tags but it took me a while to figure out that these
> get mapped to Bio::SearchIO::Result objects accessible via
> Bio::SearchIO->next_result().
>
> Is this the proper way to process the iterations?

The problem is in the way that NCBI now outputs multiple-query BLAST  
XML reports, which apparently changed sometime in the last year w/o  
notice.  This was also a problem with other Bio* parsers (I remember  
seeing something about it on the BioPython list).  Previously  
multiquery BLAST requests were output like single XML reports  
concatenated together, each with their own XML declaration, etc.  Now  
they are treated like iterations (query 1 = iteration 1, query 2 =  
iteration 2, etc) all in one long BLAST report.  There's an example  
of one in the SearchIO tests which I added to CVS in Jan-Feb,  
post-1.5.2.  The current parser handles both old and new cases.

The current behavior of the parser is to parse everything up front,  
building up the ResultI's and then returning them one-by-one upon  
next_result(), which is horrible on memory if you have tons of XML to  
wade through.  I will probably change that to carve the data up into  
report-sized chunks of XML and parse them piecemeal, but I haven't  
had time to work on it yet.

> 3. I also notice that only the first result (iteration) has the
> query_name set. Subsequent ones are empty:
> RESULT 1 Bio::Search::Result::GenericResult, algorithm= BLASTP,
> query=MyProtein , db=uniprot_sprot
> RESULT 2 Bio::Search::Result::GenericResult, algorithm= BLASTP, query=
> , db=uniprot_sprot
>
> Is this a bug or expected?

If you are using 1.5.2 then there is a bug related to that which was  
fixed in CVS a few months back (related to the multiquery issue  
above).  If it isn't let me know.

> I'm guessing a lot of these problems are simply due to limitations of
> the NCBI BLAST XML DTD?
>
> --Torsten

To tell the truth I'm not sure.  One would think they could add some  
designation to the report for PSI-BLAST!

chris


More information about the Bioperl-l mailing list