[Bioperl-l] Bio::Tools::Genscan

Hilmar Lapp hlapp@gmx.net
Thu, 03 Aug 2000 11:54:25 +0200

Dear all,

there is now a Genscan parser in Bio::Tools::Genscan.

After initialization with a file handle or a file the central method is

	$gene = $genscan->next_prediction();

which returns a Bio::Tools::Prediction::Gene object, which in turn is-a

As a technical detail, the input file is not parsed completely first, but only
up to the first predicted sequence entry. Assuming that these make up for a
major part of the file, even parsing prediction results for very large input
sequences should allow to keep the memory consumption reasonably low if you
can manage to dispose each returned object before requesting the next one.

The predicted protein, and if available (i.e., if -cds was given to Genscan)
the predicted CDS are both stored, too, although as far as I have been able to
test it calling $gene->cds() would return mostly the same result (and
translating it essentially yields the predicted peptide, but see below).

The exception is that if the initial exon is missing from a prediction there
is obviously no way to know the phase of it, and consequently you don't know
the frame of the first base of the first exon (the frame of the first base of
the initial exon is always 0). 

To automatically select the right frame I have added a method correct_phase()
to GeneStructure, which tries to adjust the frame (and phase) of a given DNA
sequence by prepending 1 or 2 Ns (in lower-case) to it until the DNA sequence
does no longer yield intervening termination codons in the frame starting with
the first base.

The automatic correction can be enabled by $gene->cds() with an argument of
TRUE, and indeed comparing the prediction to $gene->cds(1) so far always
showed no difference in the sequences. So far I haven't encountered a 'nasty'
phase problem as Ewan mentioned it for Ensembl, but if anyone has a 'hard'
example I'd be happy to test it.

Note that prepending Ns is exactly what Genscan does in its predicted CDS, so
you could even easily determine the frame by examining the number of Ns
prepended to the predicted CDS (provided you have it).

Bio::SeqFeature::GeneStructure is probably not the best place for
correct_phase() to stay, because it's rather generic to DNA sequences. At
present I'm not sure where to move it to. Has anyone any idea?


BTW the Ensembl Genscan parser is bound to bump at least on the version of
Genscan we have (1.0), because Genscan prints a lot of explanations at the end
of the file, thereby rendering the final predicted sequence invalid in FASTA
format (the sequences are read with the SeqIO::fasta.pm module).

Hilmar Lapp                                      email: hlapp@gmx.net
NFI Vienna, IFD/Bioinformatics                   phone: +43 1 86634 631
A-1235 Vienna                                      fax: +43 1 86634 727