[Bioperl-l] Bio::SearchIO::hmmer hsp behaviour

Chris Fields cjfields at uiuc.edu
Wed Jun 28 18:14:09 EDT 2006

> Sendu Bala wrote:
> [ from thread Bio::SearchIO - Accessing Model parameters (score, evalue,
> description) ]
> [ concerning hmmpfam output ]
> > I have another problem (or the same one as you? I'm can't tell...) in
> > that I can only get a single result, hit and hsp from my hmmpfam file!
> > It is doing my head in, but I might be doing something wrong so will
> > look into it further before posting a bug report.
> I was just doing something wrong, but...
> Revision 1.27 of Bio::SearchIO::hmmer did 'Change HMMER parser to report
> a single HSP per Hit so domains with multiple alignments get separate
> Hits (more FASTA like) since they aren't really HSPs'
> Strangely 1.25 (Bioperl 1.4) seems to behave like that already.
> In any case, this is extremely counter-intuitive, especially given that
> next_domain is a synonym of next_hsp. I think either the synonym
> relationship remains and hits have multiple hsps (and there is only one
> hit per model), or next_domain goes off and finds the hsp that is the
> next domain of the current model. But that would be incredibly broken in
> the current model since it would be found in a different hit object...
> What hmmpfam does is take a database of models which can be thought of
> as database sequences. Then it aligns each one against your query
> sequences. A model could align in multiple locations along a query
> sequence. Each one of these locations is called a domain of the model. A
> user of hmmpfam is model-centric (wants to know which models are on his
> query), and so you want to know all about how well the model did in one
> go. So you should be able to get the results for a model ($hit =
> $result->next_model), get overall info about it ($hit->score etc.), then
> get more detailed information about each domain of it (while ($hsp =
> $hit->next_domain) {...}). But right now you only get one domain and you
> have to go searching through all your other hits to find a hit with the
> same ->name() as your model of interest to get the next domain of your
> model.
> In my view this is less than ideal. What do people think? Should it be
> changed?

The model (hit-like) table scores are retained and can be retrieved via
$model->significance and the individual domain (hsp-like) evalues via
$model->evalue.  The reason you don't get all the individual domain evalues
is that only five alignments are returned by default.  You might try
changing the 'A' parameter to see if you can get more alignments; that may
work around the problem of missing domains for now.  You'll note that the
Model/Domain results returned are not based on top score but what looks like
the position of the domain in the sequence (seq-t in the last table); that's
what is stated in the hmmpfam docs.  Anyway, I tried this loop with the
reports Selvi sent and it works, but only for the ones that return

my $result_count = 1;
while ( my $result = $searchio->next_result() ) {
  print "Result $result_count : ",$result->query_name,"\n";
  print "Result models: ",$result->num_hits,"\n";
  while (my $model = $result->next_hit) {
	print "\tModel : ",$model->name,"\n";
	print "\tSignif: ",$model->significance,"\n";
	while (my $domain = $model->next_hsp) {
	  print "\t\tDomain : ",$domain->name,"\n";
	  print "\t\tEvalue : ",$domain->evalue,"\n";

>From the HMMER docs: "Say you have a new sequence that, according to a BLAST
analysis, shows a slew of hits to receptor tyrosine kinases. Before you
decide to call your sequence an RTK homologue, you suspiciously recall that
RTK's are, like many proteins, composed of multiple functional domains, and
these domains are often found promiscuously in proteins with a wide variety
of functions. Is your sequence really an RTK? Or is it a novel sequence that
just happens to have a protein kinase catalytic domain or fibronectin type
III domain?"

Model/domain pairs really aren't Hits/HSPs by definition, like the CVS
commit from Jason states.  The way Pfam is set up, you actually have your
query(ies) scanned using a database of Pfam domains (HMM's, built from
protein alignments for various protein families), hence the alignment in the
report is not a HSP since HSPs come from pairwise sequence alignments.  An
HSP is a pair of sequences which, when aligned, meet or exceed a maximal
cutoff.  The hmmpfam report has alignments of the sequence and the consensus
for the alignment the HMM is based on (not another sequence, so not an HSP).
This is also the same reason you can't get alignments from
Bio::Search::HSP::HMMERHSP objects since the model 'sequence' isn't a true
sequence but a consensus of sequences, so it's 'inappropriate' to use that
as an actual alignment.  Bad Bioperl user!  Bad!

I think the reasoning for keeping single model-domain pairs is that you
should consider each domain's location in the sequence as well as the number
of times they appear, regardless of whether they belong to the same model or
not.  One protein could have three ATP-binding domains and another two, and
they could be located in different positions on the sequence.  But where
they are on the sequence in relation to other domains and to each other
(i.e. positional information) is just as important, maybe more so, than how
many times that domain appears.  

Well, that and SearchIO is set up as a SAX-like parser, so I believe it
processes the model-domain alignments as the file is parsed.

My 2c: there should be a way to get all model-domain pairs in the "parsed
for domains" table (which is like a list of HSPs).  Seems the last few w/o
alignments are not retained; this may be the way the parser is set up.  I
would try getting the handler to return just evalues and similar stuff for
those and leave out sequence/alignment info, if that's possible.  Not sure
how this is handled with BLAST reports where there are more hits reported
than alignments...

> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list