[Bioperl-l] split seq feature and fuzzy feature proposal

Hilmar Lapp hlapp@gmx.net
Thu, 18 Jan 2001 12:34:24 -0800

Jason Stajich wrote:
> What you suggest above could be done as:
> Bio::SeqFeatureI ISA RangeI
> method : location
> desc   : Get/Set method
> args   : LocationI object
> returns: LocationI object
> method : start()
> desc   : start location of seqfeature
> sub start {
>         my($self) = @_;
>         return $self->location->start()
> }

Note that as one of the few noticeable changes in the SeqFeatureI
API this call should be allowed to throw an exception if
	1) the start location is uncertain
	2) the start location does not refer to the attached seq
	(to be disputed)

> ... similar for end ...
> Bio::LocationI ISA RangeI
> Bio::SplitLocationI ISA Bio::LocationI
> method: sub_SeqFeatures()
> desc  : method for obtaining list of sub Locations - they could be
>         SeqFeature::Exons, SeqFeature::Generic, or LocationI's?
> returns: list of LocationI or SeqFeatureI objects?

Yeah, that's the really hairy case. We probably should define
first what we would like to be able to do with compound locations.
This is a strong call for feedback: what do people out there using
the package intend to do with compound locations? E.g. if you draw
annotations, would you just draw the part referring to the
attached seq? Ensembl people, any experience/wishlists for this?

An obvious requirement is the ability to recover the original
GenEmbl location string, so all the information necessary should
be present.

A compound location indeed is somewhat a hybrid between a location
and a feature, because a sublocation clearly only makes sense if
you also know the sequence it refers to. The sequence can be
identified by its name (but then which name? the name in the
location line as given in GenBank?), or by an object reference?
The latter can be very expensive, because the sequence can be
quite long, and if there are many of such sublocations, you
quickly eat up your memory. You could also construct the seq
object as sort of a dummy, without really holding the seq string.
Not really convincing. So why not the simple case: a
CompoundLocation has a method sub_Locations(). Each sublocation
has a method seqname() (or seq_id() or whatever you prefer), which
returns the same string as $feature->seqname() for subfeatures
lying on the same seq, and a different name for those referring to
other seqs. $feature->seq() for features with a compound location
throws an exception, unless all sublocations are on the same
(attached) sequence.

Too simple?

> Bio::FuzzyLocationI ISA Bio::LocationI
> method: get_embl_fuzzy_string()
> desc  : possible method to return location as an embl string for a fuzzy
>        location
> returns: string

min_start()/max_start() etc should also be included. start() and
end() in an implementation are overridden and throw exceptions,
depending on which end is uncertain (and least they should be
expected to throw exceptions). A certain end can be determined by
min_start() == max_start() (or .._end(), resp.).

> Does this seem more agreeable - location is decoupled from SeqFeature, but
> we have to support backwards compatibility with SeqFeatureI ISA RangeI
> which means all SeqFeatures have a start/end...

I indeed like the decoupled approach much better.

Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757