[Bioperl-l] Gene Interface?

Hilmar Lapp hlapp@gmx.net
Mon, 04 Dec 2000 00:26:36 -0800

Ewan Birney wrote:
> So - I guess I am trying to open discussion of a gene interface, which
> probably will have to be cross-posted between ensembl-dev and bioperl-l
> (? do you agree hilmar?).

Sure I do.

A primary question triggered by the bioperl SeqFeature::GeneStructure
class where a Gene and the properties specific to a gene will end up. I
made basically two suggestions at that time: have Gene as its own
module, utilizing an associated GeneStructure and Bio::Seq, and of
course adding things like transcript(s) etc. The second was to simply
extend GeneStructure by what was missing with respect to the notion of a
gene, which came down to be transcripts. We came to some agreement that
extending GeneStructure would be the way to go.

As nothing has been done (code-wise) in this direction so far, this
issue has not matured since. So, comments are very welcome, and maybe
people have a third (or 4th ...) way of doing it.

> Let's map out some clear use cases for the generic gene interface:
>    - should be able to store transcript information
>      (one gene has multiple transcripts)

See above. This can be achieved either way quite simply. The only
question is how to model a transcript: simply an array of the right
exons in the right order, or a module inheriting off SeqFeatureI with
the right exons as subfeatures, or as a sequence, or something
different. Then there is also a predicted transcript for gene structures
arising from gene structure predictions. Do we want to treat this
separately (which is done now), or is it essentially a transcript like
any other.

>    - easy to get protein and cDNA sequences

The only thing missing here right now is annotating every exon with its
frame. This will be fixed.

>    - should be able to store exons as seqfeatures
>    ? should have slots for DBLinks/annotation (or do we want a higher
> collection interface for this? If so, how structured?)

I'd have derived classes implementing such capabilities.

>    - should not mandate an in memory implementation

Hmm. Which part and why? The reason I can see is using up too much
memory, which as far as I can imagine could almost only be caused by an
attached sequence being too big. So, it's only the sequence object that
should be able to swap itself to e.g. disk. I'm not sure whether I'm
missing something.

> Here are some issues that I think could be difficult to reconcile between
> bioperl and ensembl views:
>    - Ensembl genes and transcripts are NOT seqfeatures. The placement of
> an ensembl gene on a single coordinate system is held in something called
> "VirtualGene" (not a great name. It is a gene on a virtualcontig). Ensembl
> has a big win by allowing a gene to be built "across" coordinate systems,
> allowing the coordinate system to be by-and-large decoupled from the gene
> structure. Some "magic" is used for the places where the gene structure is
> highly dependent on the assembly.

Hmm. I guess they will stay features in bioperl. The question is then
whether this is prohibitive for the respective bioperl objects being
used in ensembl, and what we can/should do about it.

Not sure what's involved in all this, and looking forward to comments
from the ensembl guys therefore.

Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757