[Bioperl-l] Bio/SeqIO/swiss.pm parsing error
James D. White
jdw at ou.edu
Mon Nov 13 18:50:15 EST 2006
"Erik" <er at xs4all.nl> wrote:
>I noticed the parsing is borked with newest swisprot files:
> UniProt Knowledgebase Release 9 consists of:
> UniProtKB/Swiss-Prot Release 51.0 of 31-Oct-2006
> UniProtKB/TrEMBL Release 34.0 of 31-Oct-2006
>I edited my local copy of Bio/SeqIO/swiss.pm to parse the ID lines
>in swissprot/trembl according to the new specification (see
>Basically, the change is as follows:
> ID EntryName DataClass; MoleculeType; SequenceLength.
>is changed to:
> ID EntryName DataClass; SequenceLength.
>The change I made was only in the regex capturing the entry name:
>method next_seq (Bio/SeqIO/swiss.pm) :
> unless( m/
> ID \s+ #
> (\S+) \s+ # $1 entryname
> ([^\s;]+); \s+ # $2 DataClass
> [0-9]+[ ]AA \. # Sequencelength (capture?)
> /ox )
> $self->throw("swissprot stream with no ID. Not swissprot in my book");
How about something like the following to recognize both old and new formats
ID \s+ #
(\S+) \s+ # $1 entryname
( (: [^\s;]+; \s+ )? ) # $2 DataClass (including ";\s+")
[0-9]+[ ]AA \. # Sequencelength (capture?)
$self->throw("swissprot stream with no ID. Not swissprot in my book");
# Because $2 now contains a trailing ";\s+" in the new format, it needs to be fixed
$DataClass = $2 || 'default DataClass'; # provide default for old file format
$DataClass =~ s/;\s+$//; # remove trailing ";\s+"
The code trailing the unless block should be modified to use the appropriate
variable names. This is provided only to show what post-match modification is
>I tested this (=entry parsable and SeqIO created) against several
>hundred Swissprot and Trembl entries.
>Of course, files with the older format are now broken - it may be better
>to leave old and new format, and try both (newest first).
More information about the Bioperl-l