[Bioperl-l] Bio::SeqIO and bad entries in uniprot and interpro

Mikko Arvas Mikko.Arvas at vtt.fi
Fri Nov 26 08:44:26 EST 2004


here is the first entry in match.xml that gives an error:
<protein id="O00408" name="CN2A_HUMAN" length="941" crc64="9797609B487FD64E">
<interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide 
phosphodiesterase" type="Domain" parent_id="IPR003607">
<match id="PF00233" name="PDEase_I" dbname="PFAM">
<location start="655" end="892" status="T" evidence="HMMPfam" score="0.0" />
<match id="PR00387" name="PDIESTERASE1" dbname="PRINTS">
<location start="651" end="664" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="682" end="695" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="696" end="711" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="724" end="740" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="804" end="817" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="821" end="837" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<match id="PS00126" name="PDEASE_I" dbname="PROSITE">
<location start="696" end="707" status="T" evidence="AddProsite" 
score="8.0E-5" />
<match id="SSF48547" name="PDEase" dbname="SSF">
<location start="573" end="919" status="T" evidence="HMMPfam" 
score="9.099999999999999E-110" />
<interpro id="IPR003018" name="GAF" type="Domain">
<match id="PF01590" name="GAF" dbname="PFAM">
<location start="241" end="377" status="T" evidence="HMMPfam" 
score="4.7E-10" />
<location start="409" end="548" status="T" evidence="HMMPfam" 
score="9.999999999999999E-26" />
<match id="PS50813" name="GAF" dbname="PREFILE">
<location start="396" end="550" status="T" evidence="PrfScan" score="11.073" />
<match id="SM00065" name="GAF" dbname="SMART">
<location start="241" end="387" status="T" evidence="Smart" score="7.3E-18" />
<location start="409" end="558" status="T" evidence="Smart" score="6.1E-38" />
<interpro id="IPR003607" name="Metal-dependent phosphohydrolase, HD region" 
<match id="SM00471" name="HDc" dbname="SMART">
<location start="653" end="822" status="T" evidence="Smart" score="1.0E-6" />

At 22:29 22.11.2004 -0800, Hilmar Lapp wrote:

>On Monday, November 22, 2004, at 12:58  PM, Jason Stajich wrote:
>>>Same goes for interpro:
>>>my $infeat = Bio::SeqIO->new('-file' => '<match.xml',
>>>                                             '-format' => 'interpro' );
>>>while (my $feat = $infeat->next_seq) { store features etc. in here}
>>>After happily processing a lot of features it gives:
>>>not well-formed (invalid token) at line 2, column 53, byte 131 at 
>>>/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187
>Can you locate the position that raises the error? I have seen error like 
>this thrown on non-ASCII characters.
>>>I guess its no wonder that such big DBs have errors or are out of sync 
>>>with perl modules etc. and I don't mind losing one seq or feature here 
>>>or there. The files are rather big so fixing them manually is a bit 
>>>painful. But I need to somehow get most things processed, is there a way 
>>>to skip these bad entries or would you have some other smart ideas?
>XML::Parser being built on top of expat, there is really no way of 
>recovering from an XML violation that would let you resume parsing of the 
>         -hilmar
>Hilmar Lapp                            email: lapp at gnf.org
>GNF, San Diego, Ca. 92121              phone: +1-858-812-1757

Mikko Arvas
VTT Biotechnology

e-mail:            mikko.arvas at vtt.fi
tel:                 +358-(0)9-456 5827
mobile:           +358-(0)44-381 0502
fax:                +358-(0)9-455 2103
mail:               Tietotie 2, Espoo
                       P.O. Box 1500
                       FIN-02044 VTT, Finland

More information about the Bioperl-l mailing list