[Bioperl-l] Bio::SearchIO::hmmer hsp behaviour

Chris Fields cjfields at uiuc.edu
Thu Jun 29 01:20:10 EDT 2006

> I know, see my earlier post.
> [I'm using my own data, not the OP's]

Sorry, I was typing that one up over a three-hour period in between
experiments, so I didn't go back and check everything before I sent it.
Pretty much the entire file Selvi sent me (and the entire group, grrr) shows
that the domains in the domain table are not completely parsed, and the
number of reported hits correlates with the number of alignments present.
In other words, only five or less hits are reported based on the alignments
and the default max alignments reported per result is five.  I figured out
that it is a bug and plan on submitting it to Bugzilla.

What you are talking about and what Selvi describes are two separate issues.
I dealt with Selvi's for the moment; let's deal with yours.

> > Well, that and SearchIO is set up as a SAX-like parser, so I believe
> Yes, this is the problem. The parser does the obvious thing, but in my
> view it does not do the correct thing.

Yes, and that's your opinion.  To tell the truth I'm quite neutral on this;
I'm trying to reason along the lines the contributors for the module
intended.  The fact of the matter is the parser is set up to do it this way,
and it was set up this way by others (not you or I); modifying it to suit
one's personal wants and needs is not our job here.  I don't have issues
while I'm running it so I really don't see what the problem is, well,
besides the reported bug I found along with Selvi's help.

My view on all this before I quit for the night:

I'm really don't want to get into what I consider nit-picky issues (the
'semantics' you mention; it's a simple difference in opinion and a small one
at that).  We can agree to disagree, whatever.  The issue immediately at
hand, what I consider the most important, is that Selvi has uncovered a bug
with the code, as is.  But I'm going to vent here a bit.  It's late, I'm
tired, and this whole thing irks me.  It irks me a great deal. 

Personally, I don't think right now is the time to think about refactoring
this particular module, esp. since I find it essentially works.  I believe
that energy is better spent elsewhere, such as SeqIO::genbank/swiss/embl for
instance, or refactoring SearchIO::blast etc to use hashes instead of
objects to speed things up.  Or creating something yourself.  Or doing what
you currently are doing (Bio::Map).  In other words, areas where use is
high, code is aging, and refactoring is more productive.

I'll add that I'm not trying to dissuade you from trying to build your own
variation of a SearchIO HMMER parser; by all means go ahead.  The above is
how I feel.  You can build your own parser to do what you want; you can even
base it off the current SearchIO HMMER parser and see if you can set it up
to give you the results you want, using a different handler and so on.  Just
don't break the API or modify the current code based strictly on what your
opinion of how it should work is.  It was probably set up this way for a
particular reason.

According to the SearchIO HOWTO the intent for SearchIO was to 'genericize'
parsing reports with 'similar' styles, like BLAST, FASTA, HMMER, and so on.
The most prevalently parsed reports, by a long stretch, are BLAST reports,
which is what the system is based on: 


So the SearchIO system is based on the >assumption< that these reports can
be divi'd up with the data mapped into categories (Results, Hits, HSPs), so
similar objects should be able to handle them.  Domain data are currently
stored in HSP objects (HMMERHSP), but that's nothing more than a convenient
way to store HMMER report data in my opinion; the alignment matches,
strictly speaking, are not HSP's.  You could rename HMMERHit HMMERModel and
HMMERHsp HMMERDomain, but they would still, if they fit into SearchIO and
used the current event handlers, implement HitI/HSPI by inheriting from
GenericHit/GenericHSP.  Ergo, any easy way you go about it here, HMMERHit
is-a HitI and HMMERHsp is-a HSPI.  You could probably work around it by
building the 'correct' object hierarchy by setting up your own handler and
SearchIO plugin, but that risks changing API.  And, really, if you decide to
go down that path, consider what Jason is talking about when he mentions
using "under-the-hood" hashes.

> A hit could end up with no hsps (no domains), but we may not even care.
> Sometimes you really do just want to know if a particular model hit at
> all, and with what evalue/score. The current parsing model isn't
> guaranteed to tell you this even when you can read it yourself in the
> file being parsed.

For every model (hit) you should have a corresponding domain (HSP) or more
depending on your view of how the parser works, even if the domain (HSP) is
only present in the table and not in an alignment.  You shouldn't have
models w/o domains from your query (hits w/o hsps); that doesn't make any
sense.  If hmmpfam output has this then it's a serious issue, but, again,
that doesn't make sense.  All that information is in the tables in the
hmmpfam output; you can even build objects w/o alignments present (-A0)
straight from the tables.

If you wanted to know whether a particular model hit at all, grab all the
model objects ($result->models) and run through them to see if your expected
model (Annexin, Phosphoribosyl, or whatever) is there using a map/grep
block, regex, or whatever; you could autovivicate a hash or similar data
structure indicating that a particular sequence has x domains of y type.  Or
iterate through them like you would for a BLAST report.  I don't see what's
difficult about this; I do it for BLAST sequences, SeqFeatures, and many
other BioPerl objects all the time!  Yes, it can be slow; that's an issue
with object instantiation and Perl and there is no easy way around it
besides refactoring the SearchIO parsers/eventhandlers to send back hashes,
as Jason has suggested.

> You can guess at the intent of the original authors, I think, just by
> looking at those method synonyms. next_hit == next_model. next_hsp ==
> next_domain. This makes perfect sense. This is the way to correctly
> model the information in the file. The problem is that next_model
> doesn't give you the next Model (because each Model has multiple hits),
> and next_domain doesn't give you the next domain (because each hit only
> has one domain).


> Well, that's for the user to decide. But the way the results are
> presented needs to make sense. If blast results came back with all hsps
> listed out in sequence position order, would you have multiple hits per
> database sequence each with one hsp? No, because the meaning is
> completely wrong. The 'hit' is the collection of alignments of a
> particular database sequence hitting a query sequence. The alignments
> are stored in a bunch of hsps. It is absurd to have more than one hit
> object for a database+query sequence combo, because then we have
> multiple hit objects duplicating the exact same information, and 'hit'
> no longer has any meaning - it is a collection of /some/ of the
> alignments? Yet this is exactly what we have with hmmpfam result parsing.

The problem is that the module is geared to parse the output as simply as
possible, so it does it by sequence order, just like the output.  And, as
is, it makes sense to me why Eddy and Co. set it that way, not that I
completely agree with it.  Hmmpfam output is designed for annotating
sequences using Pfam HMM's, so the results are hard-coded to appear in
sequence order, not based on score or evalue.  That's the way it is; not
necessarily the best way IMHO (I would have a way to sort by evalue or model
myself as an option), but it's the only way that's currently available.
Yes, each Model can match more than one domain on a query sequence.  Again,
that this is the 'correct way' to set up this parser is your opinion; if you
want, design your own SearchIO parser.  Like I said, I don't have a problem
with using this module myself.  And I'm a bit reticent to spend the energy
overhaulin' this module when I could spend my time working on something else
I consider more constructive (or destructive, depending on your view).  

And, frankly, it's not up to the user when using code they didn't create.
You have to deal with it.  Or code something yourself to do things the way
you want.  You have the power to do that; most bioperl users don't simply
b/c they probably don't understand the class structure and OO nature of
Bioperl.  It's just a matter of where you want to spend your energy: dealing
with something that interests you or fixing other's people's broken code.


More information about the Bioperl-l mailing list