[Bioperl-l] Bio/SeqIO/swiss.pm parsing error

Erik er at xs4all.nl
Fri Nov 3 14:59:47 EST 2006


Hi all,

I noticed the parsing is borked with newest swisprot files:
  UniProt Knowledgebase Release 9 consists of:
  UniProtKB/Swiss-Prot Release 51.0 of 31-Oct-2006
  UniProtKB/TrEMBL Release 34.0 of 31-Oct-2006


I edited my local copy of Bio/SeqIO/swiss.pm to parse the ID lines
in swissprot/trembl according to the new specification (see
http://expasy.org/sprot/relnotes/sp_news.html).

Basically, the change is as follows:
  ID   EntryName DataClass; MoleculeType; SequenceLength.
is changed to:
  ID   EntryName DataClass; SequenceLength.



The change I made was only in the regex capturing the entry name:
method next_seq (Bio/SeqIO/swiss.pm) :

===============

  unless(  m/
               ^
                  ID              \s+     #
                  (\S+)           \s+     #  $1  entryname
                  ([^\s;]+);      \s+     #  $2  DataClass
                  [0-9]+[ ]AA     \.      #      Sequencelength (capture?)
                $
            /ox )
  {
    $self->throw("swissprot stream with no ID. Not swissprot in my book");
  }

===============


I tested this (=entry parsable and SeqIO created) against several
hundred Swissprot and Trembl entries.

Of course, files with the older format are now broken - it may be better
to leave old and new format, and try both (newest first).

hth,

Erik




More information about the Bioperl-l mailing list