[Bioperl-l] coloring of HSPs in blast panel
cjfields at uiuc.edu
Wed Nov 28 13:43:58 EST 2007
On Nov 26, 2007, at 7:41 PM, Steve Chervitz wrote:
> Cood catch. You're on track here with one exception: WU blast and NCBI
> blast behave differently in what they report in the hit table: WU
> blast puts the raw score in the table not the bit score as NCBI blast
> does (see example below for reference). WU blast also swaps their
> location in the HSP header relative to how NCBI reports it. It would
> be good to verify that the blast parser isn't befuddled by this. A
> quick look at SearchIO::blast and it appears that data from the hit
> table is always getting stored as score, not bits for WU blast. Not
> sure if the HSP section data are parsed correctly. I'd recommend
> looking into these things when you do your fixes.
What I have now after commits is:
GenericHit - use the best HSP when possible for bits, score/raw_score,
significance. When there is no HSP, construct a minimal Hit object
using hit table data (WUBLAST maps the score to raw_score, NCBI BLAST
maps to bits(), both map evalue/pvalue to significance). HSP mapping
seems to be correct.
One issue that has popped up is GenericHit::significance
preferentially uses the best HSP. However, GenericHSP::significance
uses evalues preferentially over pvalues; both Expect and P appear to
be parsed for WU-BLAST HSPs now (so the evalue is reported); this
apparently wasn't always the case if I read the GenericHit docs
correctly. As NCBI BLAST doesn't report pvalues we could change that
so it preferentially returns a pvalue if present, falling back to an
evalue. This would match what is found hit table more closely and
resembles what is documented for the method (for significance(), WU-
BLAST gets pvalues, NCBI BLAST gets evalues).
> So in the end, WU blast HSPs that are built from the hit table should
> report a value for raw_score and punt on bits, but NCBI HSPs so
> constructed should do the opposite. The downside to this arrangement
> is that code that works for NCBI blast hits will need modification to
> work for WU blast hits, but that is just the nature of the data. It
> shouldn't be an issue for the majority of users that stick with one
> flavor of blast and don't switch back and forth, or for users that get
> their HSP scoring data from HSP sections rather than relying on the
> hit table.
In general I get my data from the HSPs, so this shouldn't be a
significant issue (bad pun). I did find that changing it so that Hit
objects use HSP data pointed out issues with test data; hit table raw/
bit scores were rounded from the HSP score data or vice versa since
all data came from the hit table, so tests flunked.
I think changing the way minimal hit objects report data (particularly
for NCBI BLAST) will lead to a lot of confusion unless we clarify
warnings when one or the other is missing (as you also indicated).
I'm working on that now.
> Ideally, the HSP object would know whether it was NCBI or WU-based and
> issue an informative warning when attempting to access data it doesn't
> have. One solution might be for the parser to put a 'WU-' in front of
> the algorithm name for WU blast reports, so it would then be available
> for the contained hit/hsp objects. This could break anything dependent
> on algorithm name, so it would need some testing.
I can probably work around as noted above that unless you think it's
warranted to add a 'WU' designation (the version info in the Result
object has 'WashU' attached, so one could feasibly use that for
distinguishing the two report types).
Anyway, I'm committing my first batch of fixes, the significance test
will fail for at least a day until I can look into it more.
More information about the Bioperl-l