[Bioperl-l] Bio::SearchIO::hmmer hsp behaviour

Sendu Bala bix at sendu.me.uk
Wed Jun 28 19:00:11 EDT 2006

Chris Fields wrote:
>> Sendu Bala wrote:
>> In any case, this is extremely counter-intuitive, especially given
>> that next_domain is a synonym of next_hsp. I think either the
>> synonym relationship remains and hits have multiple hsps (and there
>> is only one hit per model)

> The model (hit-like) table scores are retained and can be retrieved
> via $model->significance and the individual domain (hsp-like) evalues
> via $model->evalue.

I know, see my earlier post.

> The reason you don't get all the individual domain evalues is that
> only five alignments are returned by default.  You might try changing
> the 'A' parameter to see if you can get more alignments; that may 
> work around the problem of missing domains for now.

[I'm using my own data, not the OP's]
No, I have all the alignments: 'A' isn't a problem. And I can get all
the domains. The problem is I have to check multiple different hits to
find them all.

> You'll note that the Model/Domain results returned are not based on 
> top score but what looks like the position of the domain in the
> sequence (seq-t in the last table); that's what is stated in the
> hmmpfam docs.
> Well, that and SearchIO is set up as a SAX-like parser, so I believe 
> it processes the model-domain alignments as the file is parsed.

Yes, this is the problem. The parser does the obvious thing, but in my 
view it does not do the correct thing.

> Model/domain pairs really aren't Hits/HSPs by definition, like the
> CVS commit from Jason states.  The way Pfam is set up, you actually
> have your query(ies) scanned using a database of Pfam domains (HMM's,
> built from protein alignments for various protein families), hence
> the alignment in the report is not a HSP since HSPs come from
> pairwise sequence alignments.  An HSP is a pair of sequences which,
> when aligned, meet or exceed a maximal cutoff.  The hmmpfam report
> has alignments of the sequence and the consensus for the alignment
> the HMM is based on (not another sequence, so not an HSP).

But this is just semantics. It doesn't /matter/ that its not really 
truly a sequence that's being aligned. The parser needs to present to 
the user the information in the file. As we see in the OP's example, it 
simply fails to do this because the parser isn't model-centric while the 
file it is parsing /is/.

And in any case, your argument doesn't hold because even the current 
parser /does/ store domains in hsp objects! It just only stores one hsp 
per hit, repeatedly, which is nonsensical.

[to avoid confusion, in the following the use of 'model' is in the 
programming sense, whilst 'Model' refers to the things generated by hmmer]

The correct model to describe the file being parsed is one that is able 
provide to the user all the available results for all Models that hit a 
query sequence, even when there are no alignments in the file. To make 
this fit the SearchIO scheme, we must have one hit per Model. The hit 
has hsps which are the domains. This perfectly matches the information 
in the file. It matches something like a Blast, where you have one hit 
per database sequence/query sequence combo.

A hit could end up with no hsps (no domains), but we may not even care. 
Sometimes you really do just want to know if a particular model hit at 
all, and with what evalue/score. The current parsing model isn't 
guaranteed to tell you this even when you can read it yourself in the 
file being parsed.

You can guess at the intent of the original authors, I think, just by 
looking at those method synonyms. next_hit == next_model. next_hsp == 
next_domain. This makes perfect sense. This is the way to correctly 
model the information in the file. The problem is that next_model 
doesn't give you the next Model (because each Model has multiple hits), 
and next_domain doesn't give you the next domain (because each hit only 
has one domain).

> I think the reasoning for keeping single model-domain pairs is that
> you should consider each domain's location in the sequence as well as
> the number of times they appear, regardless of whether they belong to
> the same model or not.  One protein could have three ATP-binding
> domains and another two, and they could be located in different
> positions on the sequence.  But where they are on the sequence in
> relation to other domains and to each other (i.e. positional
> information) is just as important, maybe more so, than how many times
> that domain appears.

Well, that's for the user to decide. But the way the results are 
presented needs to make sense. If blast results came back with all hsps 
listed out in sequence position order, would you have multiple hits per 
database sequence each with one hsp? No, because the meaning is 
completely wrong. The 'hit' is the collection of alignments of a 
particular database sequence hitting a query sequence. The alignments 
are stored in a bunch of hsps. It is absurd to have more than one hit 
object for a database+query sequence combo, because then we have 
multiple hit objects duplicating the exact same information, and 'hit' 
no longer has any meaning - it is a collection of /some/ of the 
alignments? Yet this is exactly what we have with hmmpfam result parsing.

More information about the Bioperl-l mailing list