[Bioperl-l] Re: Bio::Tools::Genscan

Hilmar Lapp hlapp@gmx.net
Fri, 04 Aug 2000 12:52:41 +0200

James Gilbert wrote:
> >
> > The exception is that if the initial exon is missing from a prediction there
> > is obviously no way to know the phase of it, and consequently you don't know
> > the frame of the first base of the first exon (the frame of the first base of
> > the initial exon is always 0).
> We've been through the same problems in Ensembl.
> The only effective solution we found is to parse
> the peptide sequences at the bottom of the
> report.  Unfortunately, this would mean that you
> could no longer delay parsing the whole file.

I can still delay until the section containing the predicted sequences is
reached, which I already do, the rationale being that the sequences are the
major part of the result file.

> > To automatically select the right frame I have added a method correct_phase()
> > to GeneStructure, which tries to adjust the frame (and phase) of a given DNA
> > sequence by prepending 1 or 2 Ns (in lower-case) to it until the DNA sequence
> > does no longer yield intervening termination codons in the frame starting with
> > the first base.
> This won't necessarily get you the correct
> translation, because there may be more than one
> full-length open reading frame.

You're right. If there is more than 1 ORF extending over the whole sequence
then the proposed method can no longer guarantee to yield the correct
translation. Even without the source sequence I could record the number of Ns
prepended by Genscan to the CDS prediction, provided the user gave -cds to

> I think there should be a frame or offset property
> on the object which shows where to start the
> translation.

Note that Genscan doesn't report the offset, or frame with respect to the CDS,
for every exon. The frame it does report refers to the whole sequence (at
least how I understand it). It also reports the phase of exons, and if you
look at the frame of exons following phase 0 exons, they can be different than
the frame of the phase 0 exon (for those not familiar with this: phase is
defined as length mod 3, and hence the first base of an exon following a phase
0 exon must be in the same frame as the first base of the phase 0 exon, frame
referring to the CDS). The reason for this is that the intervening intron is
not phase 0.

Another indication is the frame of initial exons. This is supposed to be zero
with respect to the CDS, but in fact it is often reported non-zero.
> The phase or frame numbers which genscan gives
> don't make sense for reverse strand transcripts.

I have checked two reverse-strand predictions and translating the CDS as
defined by the exon coordinates gave the correct results (i.e., the same as
the prediction). 

I've checked in a little test script test-genscan.pl in examples/. Please let
me know when you find a sequence for which the predicted protein sequence
differs from the extracted (the script will show both).

I'll add a frame attribute for the whole gene structure, derived from the
number of Ns prepended to the Genscan CDS prediction. I'd still like to hear
of examples for which this would be essential to enable correct translation.
Does anyone of the Ensembl people have some of the sequences that caused
trouble at hand?


Hilmar Lapp                                      email: hlapp@gmx.net
NFI Vienna, IFD/Bioinformatics                   phone: +43 1 86634 631
A-1235 Vienna                                      fax: +43 1 86634 727