Arlin Stoltzfus
Wed, 20 Sep 2000 09:54:52 -0400

Ewan Birney wrote:
> > > Also, why are introns and exons top-level features of a sequence, when
> > > they are also (obviously) sub-features of a gene?
> > >
> This is an issue with GenBank/EMBL being mapped into a more interpretable
> format.
> GenBank/EMBL sometimes puts introns/exons separate from the CDS lines.
> Quite often they *disagree* with the CDS lines. What are we meant to do in
> these cases.

It may help to know that the information on the CDS line of a GenBank  
text file is not a description of the splicing process, but a "SeqLoc" or 
sequence location for the CDS feature.  This is why it starts and ends 
with start and stop codons, and not with the beginning and ending of 
the first and last exons.  Some published surveys of exon lengths are 
actually based on interpreting the first and last intervals in the CDS 
SeqLoc statements as exons, but they are not.  

Every feature in the feature table has a SeqLoc mapping it to the 
sequence, and a SeqLoc of "1..4" is the same as "join(1..3,4)" or 
"order(1..2,3..4)" or "join(1,2,3,4)" etc, because they all specify the 
same sequence location.  A CDS that results from -1 translational 
frameshifting after the 45th nucleotide might be specified by "join(1..45,
45..599)", so that the 45th nucleotide is included twice.  Some entries 
in GenBank actually use this entirely legitimate method.  

Also, its not GenBank's decision about whether the introns and exons 
appear explicitly in the feature table-- this is because the people who 
submit sequences typically only annotate the CDS, and do not annotate 
mRNA or intron features (usually they have no experimental evidence 
to do this anyway).  Programmers can interpret the CDS SeqLoc to 
get implicit information on splicing (I do it all the time), but this 
has its risks.  

