[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq

Amir Karger akarger at CGR.Harvard.edu
Mon Dec 11 11:20:03 EST 2006

Chris Fields wrote:
> > Yes, I think. Scott Cain pointed out that GFF column 8 is the 
> > "phase", which I had never heard of before. My current, very 
> > limited, understanding is that sometimes you'll have an exon 
> > with, say, 31 bp, followed by an exon with 29 bp. When the 
> > intron gets spliced out, you eventually get an mRNA of 60 bp, 
> > which translates to a protein of 20 aa.
> > But the second exon has a phase of 1, not 0, because you 
> > can't just start translating at the first bp of the second 
> > exon and expect to get nice amino acids.
> I think the use of 'frame' here is meant relative to the DNA 
> sequence (i.e.
> ORF searching, 6 frames) and the 'phase' is relative to the mRNA (i.e.
> translation, three frames).  At least I think that's what is meant!

I agree. By the way, I'd love a reference to a simple bio-explanation of
what's happening here. Google searches for "coding sequence phase" are
not all that relevant.

> > I'm still confused as to why you would have a phase in the 
> > first exon, though. Why not just say the CDS starts 1 or 2 bp 
> > later? (This is probably a bio question, not a bioperl 
> > question, but a quick Google didn't get me an answer. "Phase" 
> > isn't a very good search term.)
> It could be b/c the location coordinates delineate the exon 
> coding boundary.
> It's conceivable the first exon in a sequence record is not 
> the first exon
> of the mRNA (i.e. there may be one or more exons prior to or 
> past the exon
> of interest that are in 'remote' sequence records).

That's certainly not the case here, because the files have the entire
genomes in them.

> Also, the ends of the lcoation may be uncertain ('fuzzy'):
> join(complement(1009..>1260),complement(AF081827.1:<1..177))

Also not the case here. These locations aren't listed as fuzzy.

Any other thoughts?

> > I guess the real question here, which Jason alludes to, is whether
> > SeqFeature->spliced_seq ought to take into account the phase 
> > information
> > of the first exon. Right now, it doesn't, so when you call
> > SeqFeature->spliced_seq->translate, you get gibberish. Are 
> there cases
> > where you would want spliced_seq to include the first bp or 
> > two? Should there be an option to spliced_seq for whether you 
> > want to take phase information into account?
> You can already pass the frame or an offset to 
> PrimarySeqI::translate().
>  We could add a '-phase' argument for
> convenience which accepts 0,1,2.

But as Jason pointed out, you should find the problem earlier. What if I
want to get the RNA sequence that will become the protein? then having a
phase arg to translate() doesn't help. Should there be a phase arg to

Which raises another bio question: at what point are the first 1 or 2 bp
dropped when you have a phase of 1 or 2? Do they appear in the mRNA? 

-Amir Karger

More information about the Bioperl-l mailing list