[Bioperl-l] Bio::Tools::HMMER refactoring

Hilmar Lapp hlapp@gmx.net
Sun, 17 Dec 2000 22:37:55 -0800

Ewan Birney wrote:
> in the Bio::Tools::HMMER::Domain object because it needs the feature2
> object created. SimilarityPair seems to make sure feature1 is created -
> why not do the same thing for feature2? (hilmar?)

There's no problem in changing SimilarityPair such that the existence of
feature2 is also ensured. The reason I didn't implement it initially is
that I have a tendency to safe-guard only those things of which I'm sure
they need it (i.e., don't secure yourself from your own bugs popping

> I don't 100% understand what I should be doing with SeqFeatureAnalysisI
> here. I have to implement parse

I'm almost sure you don't want to implement this. I still need to
discuss with Jason whether Bio::SeqFeatureAnalysis suffices in the core
as implementing class, but probably it does.

A parser almost certainly doesn't want to implement SeqFeatureAnalysis,
but if the result of its parsing is SeqFeatureI objects (as is the case
for Tools::HMMER), it probably should try to implement
Bio::SeqAnalysisParserI. So, that's probably what you should implement.
The easiest way to do so might be inheriting from
Bio::Tools::AnalysisParser. SeqAnalysisParserI requires a method
parse(), too, but Tools::AnalysisParser already does most of the job
here. Check its documentation, it's not that poor as I just realized ...

> Is it ok to
>    (a) assumme that -input is always a filehandle (ie, I can go <$input>)?
>    (b) ignore everything else?

Neither is ok. A good starting point to understand what's required is
probably Bio::Tools::AnalysisParser::parse(), as mentioned above.

> Then I have to implement next_feature (surely next_seq_feature or
> next_SeqFeature would have been a better name....)
> I really want to implement next_feature on what I return from parse()
> because in HMMER, I need to read the whole damn file before I can return a
> properely parsed seqfeature (don't ask...)

That's not a problem (unless reading the whole file is a problem). I
realize that in fact you were talking about SeqAnalysisParserI ... 

The return type of parse() is really void; one purpose is to be able to
specify multiple inputs, that is, one purpose is to reset the state of
the parser (that's exactly what AnalysisParser::parse() primarily does,
together with _initialize_state(), so most likely you want to override
this method if you decide to inherit from Tools::AnalysisParser; see
Tools::Genscan.pm as an example). next_feature() is required to return
one feature at a time, but these can obviously be taken from an array
built on parsing the file.

It is up to the implementor when the file is parsed. You could do it in
your implementation of parse(), since the user is required to call that
method before being able to retrieve features by calling next_feature().
The classes I wrote follow Tools::AnalysisParser, meaning that parse()
mainly re-initializes, and every call to next_feature() parses the next
chunk of data from input. In Genscan.pm the first call to next_feature
triggers parsing of the whole prediction section (but not the predicted

> (second issue is that I really have a Set of SimilarityPair objects, but
> that also is another matter).

If they are really somewhat independent pairs, you can return one at a
time when next_feature() is called. If they rather make up one feature,
they maybe should better be encapsulated anyway. Maybe I don't

> I am not 100% on this interface. Who uses it and is this the best way to
> do things here?

If you're talking about SeqAnalysisParserI, it is presently implemented
by Tools::AnalysisParser and therefore by all classes inheriting from
it: Genscan.pm, MZEF.pm, ESTScan.pm, and BPlite.pm should be migrated to
it, too.

The whole idea Jason and I had in mind for AnalysisParserI and
SeqFeatureProducerI is the ability to implement very generic programs
for annotating sequences with features. The scope is methods and parsers
that really produce something fitting SeqFeatureI. The concept is, a
generic program that has a sequence and a parser object implementing
SeqAnalysisParserI can obtain features from the parser and add these to
the sequence object, which can then for instance be submitted to a
module making the annotations persistent in a database.
SeqFeatureProducerI is the driver part similarly to the SeqIO system:
given a method (by name), it returns a SeqAnalysisParserI implementing
object. So, for implementing SeqFeatureProducerI there are two
mechanisms we can follow: for each new parser module add code to a
single driver Bio::SeqFeatureProducer to make it recognize it, or add a
simple module named as the method (ala SeqIO). Presently Jason suggests
to follow the first approach for simplicity, and I tend to agree. The
overall point is that you do not have to change or add anything to your
generic program, and it would still accommodate any new method. You just
specify input and method (and update bioperl :-).

I realize that SeqFeatureProducer doesn't exactly follow what I just
said ... :o| Jason and I need a few more thoughts here I guess.


Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757