[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq
akarger at CGR.Harvard.edu
Mon Dec 11 11:20:03 EST 2006
Chris Fields wrote:
> > Yes, I think. Scott Cain pointed out that GFF column 8 is the
> > "phase", which I had never heard of before. My current, very
> > limited, understanding is that sometimes you'll have an exon
> > with, say, 31 bp, followed by an exon with 29 bp. When the
> > intron gets spliced out, you eventually get an mRNA of 60 bp,
> > which translates to a protein of 20 aa.
> > But the second exon has a phase of 1, not 0, because you
> > can't just start translating at the first bp of the second
> > exon and expect to get nice amino acids.
> I think the use of 'frame' here is meant relative to the DNA
> sequence (i.e.
> ORF searching, 6 frames) and the 'phase' is relative to the mRNA (i.e.
> translation, three frames). At least I think that's what is meant!
I agree. By the way, I'd love a reference to a simple bio-explanation of
what's happening here. Google searches for "coding sequence phase" are
not all that relevant.
> > I'm still confused as to why you would have a phase in the
> > first exon, though. Why not just say the CDS starts 1 or 2 bp
> > later? (This is probably a bio question, not a bioperl
> > question, but a quick Google didn't get me an answer. "Phase"
> > isn't a very good search term.)
> It could be b/c the location coordinates delineate the exon
> coding boundary.
> It's conceivable the first exon in a sequence record is not
> the first exon
> of the mRNA (i.e. there may be one or more exons prior to or
> past the exon
> of interest that are in 'remote' sequence records).
That's certainly not the case here, because the files have the entire
genomes in them.
> Also, the ends of the lcoation may be uncertain ('fuzzy'):
Also not the case here. These locations aren't listed as fuzzy.
Any other thoughts?
> > I guess the real question here, which Jason alludes to, is whether
> > SeqFeature->spliced_seq ought to take into account the phase
> > information
> > of the first exon. Right now, it doesn't, so when you call
> > SeqFeature->spliced_seq->translate, you get gibberish. Are
> there cases
> > where you would want spliced_seq to include the first bp or
> > two? Should there be an option to spliced_seq for whether you
> > want to take phase information into account?
> You can already pass the frame or an offset to
> We could add a '-phase' argument for
> convenience which accepts 0,1,2.
But as Jason pointed out, you should find the problem earlier. What if I
want to get the RNA sequence that will become the protein? then having a
phase arg to translate() doesn't help. Should there be a phase arg to
Which raises another bio question: at what point are the first 1 or 2 bp
dropped when you have a phase of 1 or 2? Do they appear in the mRNA?
More information about the Bioperl-l