[Bioperl-l] load_seqdatabase fails when loading refseq plant files

Mike Muratet muratem at eng.uah.edu
Fri Aug 11 12:10:30 EDT 2006


Hello all

I am using biosql-schema/bioperl-db to load Refseq entries into a biosql 
database. I don't see any version info in the files, but I downloaded 
everything in the last month or so and everything passed all the tests 
when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, 
DBD-mysql-3.006. I was loading plant file from Refseq rel 18:

load_seqdatabase.pl  --dbname biosql 
--lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz

and it crashed after about 30K of 60K records:

at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl 
line 633

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values 
were ("","Direct Submission","Submitted (01-JUL-2004) National Center for 
Biotechnology Information, National Institutes of Health, Bethesda 20894, 
United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs 
(<NULL>)
Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3
---------------------------------------------------
Could not store XM_472403:
------------- EXCEPTION  -------------
MSG: create: object (Bio::Annotation::Reference) failed to insert or to be 
found by unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store 
/usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
STACK Bio::DB::Persistent::PersistentObject::store 
/usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272
STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children 
/usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create 
/usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216
t

I traced the error back through the source and database and found that 
XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, 
but only the last one crashed the script (in spite of --safe).

Should there be more info included in the CRC field? I am weak when 
it comes to RDBMs, but looking at the schema, I would guess that the CRC field
was added to make an otherwise degenerate key unique. Would it help to add 
more fields to the CRC, or another key? The former might be done without 
have to change a lot of code.

Thanks

Mike


More information about the Bioperl-l mailing list