[Bioperl-l] SeqIO (stress) testing

Hilmar Lapp hlapp@gmx.net
Wed, 20 Dec 2000 02:13:33 -0800

Kris Boulez wrote:
> - starting from t/test.genbank, writing a swiss-prot file gives (we die,
>   no error thrown)

test.genbank is DNA. Do you translate it?

Genbank (DNA) and Swissprot feature tables are basically incompatible.
The post I quoted lately contains an example I think. (E.g., you can't
have 'source' in a Swissprot feature table; the latter is supposed to
contain only protein sites.)

> - starting from t/test.genbank, writing a gcg file, reading this gcg
>   file gives
> -------------------- EXCEPTION --------------------
> MSG: Looks like start of another sequence. See documentation.
> SCRIPT: seqtest.pl
> Bio::SeqIO::gcg::next_seq(123)
> main::seqtest.pl(14)
> ---------------------------------------------------
> - starting from t/test.embl, there is a problem for SeqIO to read a gcg
>   file it wrote himself (it just loops forever). I will investigate this
> one further as it's not clear when/what happens.

The GCG module seems to be broken. I wanted to use it some time ago, but
it even didn't want to read simple sequence files. At that time we had
GCG 10, maybe something in the format has changed. GCG format is
problematic, because there really isn't a genuine GCG format. A Genbank
sequence in GCG format is in fact the sequence in Genbank format with 1
header line prepended and the sequence formatted specially (with a line
containing checksum etc, and the notorious two dots). Likewise for a
EMBL sequence.

How many people have a serious interest in this module? If there are
some, could you also provide some example files of a recent GCG version
(e.g., 10.1); I personally don't have access to GCG presently.

> By looking at the test (and test sequences) we have now I saw that we
> only try to read the first sequence from our test sequence files (apart
> >from GCG, which reads more then one file). The test.embl even contains
> only one sequence. I think that we should test for reading/writing
> multiple sequences from one file.

Genbank format and FASTA are tested for reads of multiple entries.
(Check further down the script.)

Hilmar Lapp                                email: hlapp@gmx.net
GNF, San Diego, Ca. 92122                  phone: +1 858 812 1757