[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro

Mikko Arvas Mikko.Arvas at vtt.fi
Fri Nov 26 08:32:46 EST 2004

Thanks a lot! Its fine now.

At 15:58 22.11.2004 -0500, Jason Stajich wrote:

>On Nov 18, 2004, at 6:53 AM, Mikko Arvas wrote:
>>I want to get all available Interpro matches for S. cerevisiae and some 
>>other species. So I need to parse Uniprot files to find a set of IDs for 
>>a given species and then get the Interpro matches from them. But the 
>>Uniprot release uniprot_trembl.dat gives an error towards the end of the 
>>file in next_seq call:
>>my $inseq = Bio::SeqIO->new('-file' => '<uniprot_trembl.dat',
>>                                       '-format' => 'swiss');
>>while (my $seq = $inseq->next_seq) { check species etc. in here}
>>After happily processing a lot of sequences it gives:
>>Invalid [] range "6-1" in regex; marked by <-- HERE in m/^Tomato severe 
>>leaf curl virus-[Guatemala 96-1 <-- HERE ]$/
>>Same goes for interpro:
>>my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
>>                                             '-format' => 'interpro' );
>>while (my $feat = $infeat->next_seq) { store features etc. in here}
>>After happily processing a lot of features it gives:
>>not well-formed (invalid token) at line 2, column 53, byte 131 at 
>>/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187
>>I guess its no wonder that such big DBs have errors or are out of sync 
>>with perl modules etc. and I don't mind losing one seq or feature here or 
>>there. The files are rather big so fixing them manually is a bit painful. 
>>But I need to somehow get most things processed, is there a way to skip 
>>these bad entries or would you have some other smart ideas?
>I think this has to do with some unsafe code the swiss.pm module which
>compares the species name against a list of Unknown species name values 
>and is trying to interpret the 96-1 as a range in a regexp.  Putting a \Q 
>in front of the variable where this is being compared should be enough to 
>fix it.  This is the grep on line 986.
>- return if grep { /^$binomial$/ } @Unknown_names;
>+ return if grep { /^\Q$binomial$/ } @Unknown_names;
>There was one more place in the code that did this as well which I think I 
>have fixed.
>I'm checking this in to CVS so do a cvs update and see if you problem 
>persists.  I've tested it against the uniprot_trembl.dat.
>Not sure what the problem is with the interpro parser, someone else will 
>need to look into that.
>>I have bioperl 1.4. and latest Bio::SeqIO (for swiss.pm to work 
>>correctly) from CVS on SuSe8.1.
>>Thanks a milloin for any help!
>>Mikko Arvas
>>VTT Biotechnology
>>e-mail:            mikko.arvas at vtt.fi
>>tel:                 +358-(0)9-456 5827
>>mobile:           +358-(0)44-381 0502
>>fax:                +358-(0)9-455 2103
>>mail:               Tietotie 2, Espoo
>>                       P.O. Box 1500
>>                       FIN-02044 VTT, Finland
>>Bioperl-l mailing list
>>Bioperl-l at portal.open-bio.org
>Jason Stajich
>jason.stajich at duke.edu

Mikko Arvas
VTT Biotechnology

e-mail:            mikko.arvas at vtt.fi
tel:                 +358-(0)9-456 5827
mobile:           +358-(0)44-381 0502
fax:                +358-(0)9-455 2103
mail:               Tietotie 2, Espoo
                       P.O. Box 1500
                       FIN-02044 VTT, Finland

More information about the Bioperl-l mailing list