[Bioperl-l] retrieval of PRELIMINARY uniprot sequences using Bio::Registry fails

Chris Fields cjfields at uiuc.edu
Wed Sep 6 10:59:01 EDT 2006


Brian,

I have found the issue with Bio::SeqIO::swiss; apparently UniProt has
switched to using the following ID line format:

ID   ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.

For SwissProt ID's

ID   CYC_BOVIN      STANDARD;      PRT;   104 AA.
ID   GIA2_GIALA     STANDARD;      PRT;   296 AA.

For TrEMBL (preliminary protein):

ID   Q5XPV6      PRELIMINARY;      PRT;   231 AA.

SeqIO 'swiss' sequence output currently uses the first (SwissProt) version;
it's hardcoded in a sprintf() statement.  I guess TrEMBL didn't have a
designation before, so this complicates things a little.

There are a few other (small) formatting differences I have also found which
we could update fairly easily.  

In the section of the release notes describing differences between
SwissProt/EMBL format, this is listed:

* EMBL entry ID lines have an additional three-letter taxonomic division
'token' inserted between the data class and the molecule type;

I suppose we could use division() to store 'STANDARD' and 'PRELIMINARY' (or
'Swiss-Prot' and 'TrEMBL' if that's nicer).

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Daniel Lang
> Sent: Wednesday, September 06, 2006 4:12 AM
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] retrieval of PRELIMINARY uniprot sequences using
> Bio::Registry fails
> 
> Hi Brian,
> 
> I'm iterating now over all uniprot_trembl sequences and record for which
>  retrieval fails - Lets see if STANDARDs also fail...
> 
> How is the second field of the swissprot ID line handled anyway? Because
> PRELIMINARYs end up as STANDARD when being parsed by Bio::SeqIO::swiss.
> 
> On the other side I'm still confused why there's no error or warning
> when the retrieval fails. Can you give me a hint which modules (besides
> swiss.pm) to look at?
> 
> Cheers,
> Daniel
> 
> Brian Osborne wrote:
> > Daniel,
> >
> > Well, if you can isolate the bug please add it to bugzilla.
> >
> > Brian O.
> >
> >
> > On 9/5/06 5:57 AM, "Daniel Lang" <daniel.lang at biologie.uni-freiburg.de>
> > wrote:
> >
> >> Hi Brian,
> >>
> >> sorry for the belated response!
> >> I've compiled you a set of 100 PRELIMINARY entries from the latest
> >> uniprot_trembl release. I've tried to reproduce the bug using only
> these
> >> as input to build an index, but (sadly) all of them can be retrieved
> >> using the latest checkout:-(
> >> Maybe its not connected to these entries after all, but the size or
> some
> >> other feature of the uniprot distribution?
> >> I now could make it work using the 1.5.1 release.
> >>
> >> Originally, I've built the index using flat protocol, when I try bdb
> and
> >> bioperl-live even more problems occur:
> >>
> >> bp_bioflat_index.pl --dbname sw -i bdb -f swiss -l . -c
> uniprot_sprot.dat
> >>
> >> ------------- EXCEPTION  -------------
> >> MSG: The lineage 'Eukaryota, Metazoa, Chordata, Craniata, Vertebrata,
> >> Euteleostomi, Amphibia, Batrachia, Anura, Mesobatrachia, Pipoidea,
> >> Pipidae, Xenopodinae, Xenopus, Silurana, Xenopus, tropicalis' had two
> >> non-consecutive nodes with the same name. Can't cope!
> >> STACK Bio::DB::Taxonomy::list::add_lineage
> >> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy/list.pm:163
> >> STACK Bio::DB::Taxonomy::list::new
> >> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy/list.pm:100
> >> STACK Bio::DB::Taxonomy::new
> >> /home/lang/bioperl/bioperl-live/Bio/DB/Taxonomy.pm:106
> >> STACK Bio::Species::classification
> >> /home/lang/bioperl/bioperl-live/Bio/Species.pm:171
> >> STACK Bio::SeqIO::swiss::_read_swissprot_Species
> >> /home/lang/bioperl/bioperl-live/Bio/SeqIO/swiss.pm:1049
> >> STACK Bio::SeqIO::swiss::next_seq
> >> /home/lang/bioperl/bioperl-live/Bio/SeqIO/swiss.pm:240
> >> STACK Bio::DB::Flat::parse_one_record
> >> /home/lang/bioperl/bioperl-live/Bio/DB/Flat.pm:333
> >> STACK Bio::DB::Flat::BDB::_index_file
> >> /home/lang/bioperl/bioperl-live/Bio/DB/Flat/BDB.pm:235
> >> STACK Bio::DB::Flat::BDB::build_index
> >> /home/lang/bioperl/bioperl-live/Bio/DB/Flat/BDB.pm:218
> >> STACK toplevel
> >> /share/apps/bioperl/bioperl-live/scripts_temp/bp_bioflat_index.pl:113
> >>
> >> But I think this is connected to the new changes to taxonomy handling
> in
> >> Bio::Taxon...
> >> I'm unsure wether to submit this separately, but I could also provide
> an
> >> example of such a swissprot entry that causes this error.
> >>
> >> Thanks, again.
> >>
> >> Daniel
> >>
> >> Brian Osborne wrote:
> >>> Daniel,
> >>>
> >>> Bug, presumably in SeqIO/swiss.pm. Can you send me a small file with
> such a
> >>> PRELIMINARY entry?
> >>>
> >>> Brian O.
> >>>
> >>>
> >>> On 9/1/06 6:11 AM, "Daniel Lang" <daniel.lang at biologie.uni-
> freiburg.de>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> when using Bio::Registry (bioperl-live) to fetch uniprot entries from
> >>>> local indexed uniprot *.dats, I had to realize that several entries
> >>>> could not be retrieved despite the fact that they are present in the
> >>>> files! A closer look reveals that they are of status PRELIMINARY:
> >>>>
> >>>> uniprot_trembl.dat:ID   Q16EZ1_AEDAE   PRELIMINARY;   PRT;   222 AA.
> >>>>
> >>>> I don't "grep" PRELIMINARY anywhere in my cvs checkout..
> >>>> I also can't retrieve the sequences from the online database defined
> as
> >>>> follows:
> >>>> [swissprot_ebi]
> >>>> protocol=biofetch
> >>>> location=http://www.ebi.ac.uk/cgi-bin/dbfetch
> >>>> dbname=swall
> >>>>
> >>>> Is this a bug or a feature? If its a feature, how can I bypass it?
> >>>>
> >>>> Thanks in advance,
> >>>> Daniel
> >>>
> >>
> >>
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list