[Bioperl-l] GeneStructure interfaces

Hilmar Lapp hilmarl@yahoo.com
Sun, 18 Feb 2001 12:42:45 -0800

"Alan Robinson (EBI)" wrote:
> 1) There are three 'GeneStructure' files- Two implementations
>    'Bio::SeqFeature::GeneStructure' and
>    'Bio::SeqFeature::Gene::GeneStructure' and an interface
>    'Bio::SeqFeature::Gene::GeneStructureI'
>    There are differences between the implementations (e.g. the 'Gene'
>    one has no cds() methods) and it appears that neither implements
>    the GeneStructureI interface?
>    Which is the definitive? Is one of them cruft?
>    Depending on the above, the next 5 comments may be invalid.

The 'definitive' stuff is supposed to sit in
Bio::SeqFeature::Gene::*. Bio::SeqFeature::GeneStructure was my
first cast inspired by the need to come up with some sensible
objects gene structure prediction parsers could return. So that's
cruft as of now, I just haven't removed the module yet.

> 2) The 'GeneStructureI' interface has no 'introns()' method.

That's right. I was unsure whether to put it in the interface,
too. Ewan wrote up the proposal and he didn't. If people think
it's essential, no problem I put it there, too.

> 3) The 'GeneStructureI' interface is documented as returning an array
>    of 'ExonI' objects from the 'utrs()' method, but both the
>    'GeneStructure' implementations are documented as returning arrays
>    of 'SeqFeatureI' objects.

This is a documentation bug, thanks for pointing this out.

I take this opportunity to elaborate a bit more about my ideas,
hoping that this triggers some more feedback (positive or
negative, I don't mind). My impression is that there is some
confusion about whether you call UTR exon or not. UTR is clearly
transcribed, but is clearly not translated, so it's not expressed.
So in terms of expressed sequence, it's not an exon, but in terms
of building blocks of the mRNA used for translation it clearly is
an exon. Gene structure prediction tools won't normally predict
UTR, because their base composition is not significantly different
from intergenic regions (you can also have repeats there).

Now, the other question is what do you want to get out of the
object (because things you don't need to query you don't need to
store). I understood Ewan's write-up such that you want both the
CDS and the whole biological transcript, which certainly includes
UTR (and possibly some promoter, but it appears I'm being educated
right now by Ensembl people).

So, my solution was to treat translated and UTR equal in the sense
of transcription, and require them both as ExonI objects, meaning
that every ExonI object is transcribed. ExonI got an additional
method is_coding() with obvious meaning. $transcript->exons() will
also return objects you added via add_utr($utr), unless you say
$transcripts->exons('coding'). mrna() will just pull out all
exons, prefix the promoter (complaints all over the place?), and
append the polyA site. $transcript->cds() pulls out all coding
exons (and takes care of phase/frame adjustments etc).

> 4) For the 'exons()' method; 'Bio::SeqFeature::GeneStructure' is
>    documented as returning an array of 'SeqFeatureI'
>    objects. 'GeneStructureI' and
>    'Bio::SeqFeature::Gene::GeneStructure' are documented as returning
>    an array of 'ExonI' objects.

Bio::SeqFeature::GeneStructure is obsolete.

> 5) The 'utrs()' method of 'Bio::SeqFeature::Gene::GeneStructure' has
>    optional arguements that are not documented in the 'GeneStructureI'
>    interface for specifying the type of UTR to be returned.

That's right. The same goes for
Bio::SeqFeature::Gene::TranscriptI. The implementation is not
required to take arguments. There are good reasons to require
them, because 5' and 3' UTR are clearly not the same for
constructing a transcript sequence.

Feedback appreciated.

> 6) The Bio::SeqFeature::GeneStructure' implementation has an optional
>    arguement for the 'cds()' method to specify if the returned CDS
>    should be corrected for the phase; however the 'cds()' methods of
>    other objects (e.g. Transcript, Exon and ExonI) specify that the
>    CDS returned must be in phase by the addition of N's at the
>    beginning.

That's right. Now TranscriptI and ExonI do this job, which is nice
because it allows for free exon shuffling. (As a consequence, an
exon *must* have the frame set, which required some reworks in

> 7) The 'Transcript' implementation includes as exon types in the sort
>    order 'utr5prime', 'utr3prime' and 'ployA'; whilst the 'Exon'
>    implementation only includes 'utr' as a valid exon type. Is there a
>    particular reason not to have the prime-ness included for the
>    'Exon' valid types?

The valid types in Exon are matched as case insensitive regular
expressions, so they allow for prime-ness (but don't require it).
The main reason for checking of valid type in Exon is that you
don't specify a completely inappropriate primary_tag.

Hilmar Lapp                              email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                phone: +1 858 812 1757