[Bioperl-l] Hilmar and Ewan debate SeqFeatures some more...

Ewan Birney birney@ebi.ac.uk
Fri, 19 Jan 2001 08:45:58 +0000 (GMT)

Ok. Hilmar and I are now probably into the "code aesthetics"  part of this
debate, which definitely is worth having but someone sometime has to make
a decision.

I suggest that we keep bashing this out on the list for a couple more days
(please... other people... if you have a view, do chip in). If Hilmar and
I are still disagreeing with aesthetics I would like to nominate Jason to
tie-break on the way to go (is this ok with you Hilmar and Jason...?)

We have two points of contention:

(a) Explicit Location objects or not.

Hilmar suggests an explicit location object

   SeqFeatureI has-a LocationI

   LocationI is sub classed for Split (join statements) and Fuzzies

Benefits - (a) easy to mix and match implementations of locations to
different feature objects, and (b) if mix and matching locations to
features is common, more realisatic. Hilmar argues that is clearer as

Against - more objects and infact the majority of seqfeatures are little
more than the location, and two extra strings. 

For backwards compatibility, I think SeqFeatureI->start would *have* to be
delegated to SeqFeatureI->location->start - otherwise too much code will
break... (of course, this delegation could just be for a while as we move
code and people over to using "proper" locations)

People might be interested that I originally argued for an explicit
location object about 1 month ago. I don't now... 

I am suggesting that SeqFeatures do not have an explicit location object,
but we subclass SeqFeatures into Split, Simple and Fuzzy, all inherieting
from a common SeqFeature interface

Benefits - (a) less objects (b) only one place where the client gets the
information and (c) more backwardly compatible.

Effectively my main argument is that there will always be a pretty clear
cut relationship that "this type of SeqFeature" is always "this class of
location" so the splitting of the location away from the SeqFeature is
just suggesting a mix-and-match world which doesn't actually exist.
Simpler and stronger to go for the combined interface in my view.

(b) ->start ->end throwing exceptions or not.

Hilmar says that for at least Fuzzies and possibly Splits the client
should figure out by rooting around the object how to map these more
complex locations to a simple start,end. The interface should allow
exceptions to be thrown on ->start/->end indicating that the client should
be treating this seqfeature somehow differently...

Basically we pass the buck to the client.

I say that the implementation objects have to provide a default mapping
of whatever ->start and ->end are. This means that clients can live in
this happy world of "I have well defined start/ends" if they so wish
without writing extra code. Smart clients are encouraged to root around in
the objects for their "real" interpretation of the fuzziness.

There are three reasons why I favour this:

   (a) Clients for dumping/drawing/manipulation have to treat large
numbers of sequence features as a pretty homogeneous mass. If we make
seqfeatures less homogeneous then every client is going to have to figure
out how to "homogenize" the seqfeatures - this will be different client to
client although for the main case they just want a "default way" of
handling them. We are encouraging a diversity of views when our clients
really want us to solve the problems for them.

   (b) as 99% of features are nice, well behaved "hard features" many
pieces of client code written with the bioperl libaries will just assumme
->start,->end do not throw exceptions. When this piece of code is used by
another user with a fuzzy feature, there will be a rather deep exception
thrown by bioperl through the client code. I think both the user and the
client with some justification will blame bioperl for this, no matter how
much we say "you should have read the documentation and written 3
different subroutines to replace every time you go

   if( $one->start == $two->start ) 

gets replaced by

   if( &my_exact_function($one,$two) ) {



sub my_exact_function {

   # one of many if statements...

   if( $one->isa('Bio::FuzzyFeatureI') && 
	$two->isa('Bio::SimpleFeatureI') {



   (c) long experience with seqfeatures has made me claim that the
following rules are generally just what people want:

    - simple features - easy

    - join statements - ignore leading and trailing '<' '>' and take the
edge start/end points on the sequence you are looking at

    - fuzzy features - either skip or - if you have to draw/compare them,
take start/end as the min hard location mentioned and the maximum hard
location mentioned, irregardless of the internal grammar.

I reckon bioperl will be better to implement the (c) method by default
without preventing smart clients from making their own decisions.

Another long email, but worth I think knowing where we disagree...

Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420