[Bioperl-l] Gene Interface discussion

Hilmar Lapp hlapp@gmx.net
Sun, 04 Feb 2001 23:25:44 -0800

Apart from BioCorba 0.2, this is the last big issue on the task
pile for 0.7. To refresh your background, Ewan and I discussed
this in a phone call in December last year, and after that Ewan
summarized the results, and added some things he dreamt up :-)

You can find Ewan's full proposal at

I cross-posted this again to ensembl-dev basically to make the
folks there aware that the issue is being taken on again.
Responses from bioperl folks should probably NOT be cross-posted
(but decide yourself) ...

Ewan Birney wrote:
>   All interfaces in the Bio::SeqFeature:: namespace

There are 3 of them -- together with implementations
Bio::SeqFeature may become a bit crowded. What do you think about
Bio::SeqFeature::Gene, or directly Bio::Gene?

I don't have a strong opinion here, though.

>   GeneStructureI - inheriets from SeqFeatureI
>   (inherieted methods, start,end,strand,seq,entire_seq,seqname,primary_tag,source_tag
>    is_single_sequence, sub_SeqFeatures);
>   Notes: sub_SeqFeatures must delegate to ->transcripts.
>        : primary_tag must be 'genestructure'

And what about promotors() and poly_adenylation_sites()? These are
subfeatures, too. So, sub_SeqFeature() should rather merge them
all together, shouldn't it?

> # GeneStructureI must implement this, even if it returns an empty list
> @promotors = $gs->promotors(); # could be empty
> @polya     = $gs->polya(); # could be empty

So what is the difference to the respective methods of
TranscriptI? Delegates to the first element on the array returned
by $gs->transcripts()?

>   TranscriptI - inheriets from SeqFeatureI
>   (inherieted methods, start,end,strand,seq,entire_seq,seqname,primary_tag,source_tag
>    is_single_sequence, sub_SeqFeatures);
>   Transcript must have the following two methods
>   $transcript->cdna();    # returns a Bio::PrimarySeqI of the cDNA
>   $transcript->protein(); # returns a Bio::PrimarySeqI of the protein

protein() I think is trivial unless it's a predicted transcript
(and TranscriptI is not specific to predicted transcripts). What
is the particular reason to require it in the interface?

>    ExonI - inheriets from SeqFeatureI, cannot be composite,
>            primary_tag must return one of 'exon' or 'cds' or 'utr'

How do you mean 'cannot be composite'? The interface cannot forbid
it. Should the implementation refuse subfeatures not lying on the
same sequence and refuse a SplitLocation spread across more than
one sequence (or SplitLocations in general)?

> To Do list:
>    (a) discuss this proposal. Sane? Any more issues to be worked out?
>    I am not 100% on the exons('argument') style call.

I think that's fine. Otherwise you end up with methods for each
type of exon (initial, terminal, internal, ...).

>    The exon primary_tag is actually a hard thing to provide. Should
>    the primary_tag change depending on the argument - this is very
>    nasty for the implementation objects.
>    (b) figure out how to get these things in and out of
>        EMBL/GenBank format without loss of information

In general this would be a *very* good thing to have. But it also
means venturing on the semantics of Genbank features. If this
shall make it into 0.7, we'll have to extend the deadline.

How do people see the chances of success, and in which time frame?

Any takers?

>    (c) Ditto with GAME


> Implementations:
>     Hilmar/Ewan to do bioperl implementations
>     Hilmar to do bioperl parsing modules
>     Ewan/Hilmar to do the interfaces files

Interface & implementation is okay, and I'll take care of the gene
prediction parsers. The GenBank/EMBL gene feature needs a
braveheart who either has enough time or already enough code, or -
probably the best - both.

Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757