[Bioperl-l] SeqIO (stress) testing

Kris Boulez krbou@pgsgent.be
Wed, 20 Dec 2000 12:13:57 +0100

Quoting Hilmar Lapp (hlapp@gmx.net):
> Kris Boulez wrote:
> > 
> > - starting from t/test.genbank, writing a swiss-prot file gives (we die,
> >   no error thrown)
> test.genbank is DNA. Do you translate it?
Nope, checked test.fasta to be protein, forgot this one.
Should this matter (i.e. does Swissprot checks it is writing a protein
sequence) ?

> Genbank (DNA) and Swissprot feature tables are basically incompatible.
> The post I quoted lately contains an example I think. (E.g., you can't
> have 'source' in a Swissprot feature table; the latter is supposed to
> contain only protein sites.)
> > 
> > - starting from t/test.genbank, writing a gcg file, reading this gcg
> >   file gives
> > -------------------- EXCEPTION --------------------
> > MSG: Looks like start of another sequence. See documentation.
> > SCRIPT: seqtest.pl
> > STACK:
> > Bio::SeqIO::gcg::next_seq(123)
> > main::seqtest.pl(14)
> > ---------------------------------------------------
> > 
> > - starting from t/test.embl, there is a problem for SeqIO to read a gcg
> >   file it wrote himself (it just loops forever). I will investigate this
> > one further as it's not clear when/what happens.
> > 
> The GCG module seems to be broken. I wanted to use it some time ago, but
> it even didn't want to read simple sequence files. At that time we had
> GCG 10, maybe something in the format has changed. GCG format is
> problematic, because there really isn't a genuine GCG format. A Genbank
> sequence in GCG format is in fact the sequence in Genbank format with 1
> header line prepended and the sequence formatted specially (with a line
> containing checksum etc, and the notorious two dots). Likewise for a
> EMBL sequence.
> How many people have a serious interest in this module? If there are
> some, could you also provide some example files of a recent GCG version
> (e.g., 10.1); I personally don't have access to GCG presently.

Given the widespread use of GCG there is (I guess) an intrest. We found
out this undefinedness of the GCG format in another project as well.

> > By looking at the test (and test sequences) we have now I saw that we
> > only try to read the first sequence from our test sequence files (apart
> > >from GCG, which reads more then one file). The test.embl even contains
> > only one sequence. I think that we should test for reading/writing
> > multiple sequences from one file.
> > 
> Genbank format and FASTA are tested for reads of multiple entries.
> (Check further down the script.)
I missed the Genbank test. As far as I can see the test for Fasta is
using Bio::SeqIO::MultiFile (test 17) or works on one sequence (tests