[Bioperl-l] Error while running load_seqdatabase.pl

Hilmar Lapp hlapp at gmx.net
Tue Jan 30 21:59:21 EST 2007


George,

as I suggested, you don't have a value for the primary_id(). I use  
it, for example, to store the NCBI GI#, if the record has it, but  
yours don't. IMHO the bioperl parser is mistaken in using the FASTA  
ID token for the primary_id(), but that's what it does, and you can  
undo it by setting the value to undef, which is what it should be so  
long as you don't exactly know a better value.

You set the value to undef by doing

	$seq->primary_id(undef);

If you print the value afterwards, don't be confused by seeing some  
strange value there; this is only because the Bioperl model requires  
primary_id() to return something that is unique, and if you don't set  
a value (or reset it to undef), it will use the memory location of  
the object. Bioperl-db recognizes this and will not store this dummy  
value, since in BioSQL the identifier column is optional (nullable).

If you are unconvinced of this, use the same value as you use for the  
accession, just don't use something the uniqueness of which is  
questionable, and don't use a string longer than 40 chars.

You still haven't explained on how you got the FASTA file in the  
first place. The reason I have been asking is that I suspect that  
even your first component is not unique in the file as it looks like  
you are pulling database search hits from a sequence database (BLAST,  
fastacmd?), and your sequences may get hit multiple times. I have no  
way of verifying how unique your identifiers are, so you need to  
check that yourself. You have done so for the 4th element already,  
what about the first?

It is key that you know how to uniquely identify a sequence in your  
file, if you want every single sequence in the file as a separate  
record in the database. If no part of your ID line is unique but you  
want it to be, append a random (or incremented) number.

Hth,

	-hilmar

On Jan 30, 2007, at 12:32 PM, George Heller wrote:

> Hi Hilmar,
>
> I have been trying to get around this problem for some days now,  
> but havent had much luck. My file has about 12000 odd records, and  
> when I try to load it with a pipeline to the package, I have about  
> 483 records that get loaded.
>
> sub process_seq
> {
>   my ($self, $seq) = @_;
> #  $seq->accession_number($seq->display_id);
>   my @ids = split(/\|/,$seq->display_id);
>   $seq->accession_number($ids[0]);
>   $seq->primary_id($ids[2]);
>   return ($seq);
> }
> I am assuming this is because I use the ids[2] for the primary_id,  
> and possibly the file has 483 unique records for that field. When I  
> use some other reference like ids[3] etc(coz it has more unique  
> values), I get the same error about the length being more than 40.
>
> I printed out the values for the display_id, primary_id and  
> accession_number. The display_id has the entire first line of the  
> file, so does the primary_id, if I dont set it in my new script.  
> The accession number is split correctly and printed.
>
> This is how the first few lines in my file look like,
>
> >FGENESHT0000001||AC155633|570|4400|1
> MTERKRKEIEDRKRKISGPQPGSSNRPRFSGNQPQQFRQNQRPPQQHQQFQRQYPQHQYQNRQSNQSGGQ 
> FQRQNQQAPR
> LPAPAAQQNSQATPAQVGNRACFHCGEQGHWVMQCPKKAAQQQSGPNAPAKQNVPQPRAGNRSQPRYNHG 
> RLNHLEAEAV
> QETPSMIVGMFPVDSHIAEVLFDTGATHSFITASWVEAHNLPITTMSTPIQIDSAGGRIRADSICLNICV 
> EIRGIAFPAN
> LIVMGTQGIDVILGMNWLDKYQAVISCDKRTIKLMSPLGEEVVTELVPPEPKRGSCYQLAVDSSEVDPIE 
> SIRVVSEFPD
> VFPKDLPGMPPERKVEFAIELLPGTAPIFKRAYRISGPELVELKEQIDELSEKGYIRPSTSPWAAPVLFV 
> EKKDGTKRMC
> IDYRALNEVTIKNKYPLPRIEDLFDQLRGASVFSKIDLRSAFFMNLMNSVFMDYLDKFVVVFIDDILVYS 
> QSEEEHADHL
> KMVLQRLREHQLYAKLSKCEFWINEVLFLGHIINKEGLAVDPKKVANILNWKAPTDARGIKSFIGMVGYY 
> RRFIEGFSKI
> AKPMTALLGNKVEFKWTQKCQEAFEALKEKLTIAPVLVLPDVHKPFSVYCDACYTGLGCVLMQEGRVVAY 
> SSRQLKVHEK
> NYPIHDLELAAVVHALKTWRHYLYGQKCDVYTDHKSLKYIFTQSELNMRQRRWLELIKDYELEIHYHPGK 
> ANVVADALSR
> KSQVNLMVARPMPYELAKEFDRLSLGFLNNSRGVTVELEPTLEREIKEAQKNDEKISEIRRLILDGRGKD 
> FREDAEGVIW
> FKDRLCVPNVQSIRELILKEAHETAYSIHPGSEKMYQDLKKKFWWYGMKREIAEHVAMCDSCRRIKAEHQ 
> RPAGLLQPLQ
> IPQWKWDEIGMDFI  
> VGLPRTRAGYDSIWVVVDRLTKSAHFIPVKTNYSSAVLAELYMSRIVCLHGVPKKIVSDRGTQFTS
> HFWRQLHEALGTHLNFSSAYHPQTDGQTERTNQILEDMLRACALQDQSGWDKRLPYAEFSYNNSYQASLK 
> MSPFQALYGR
> SCRTPLQWDQPGEKQVFGPDILLEAEENIKMVRENLKIAQSRQRSYADTRRRELSFEVGDFVYLKVSPIR 
> GVKRFGVKGK
> LAPRYIGSYQILARRGEVAYQLSLPENLSAVHDVFHVSQLKKCLRVPEEQLPVEGLEVQEDLTYVEKPVQ 
> ILEVADRVTR
> RKTIRMCKVRWNHHSEEEATSEREDDLMAKYPELFASQP*
> Any suggestions?
>
> Thanks!
> George.
>
>
>
>
>
> Hilmar Lapp <hlapp at gmx.net> wrote:
> That's odd indeed. Did you try and put a print statement before the
> return statement that proves that 1) the codes gets executed, and 2)
> display_id(), primary_id() and accession_number() have the expected
> values?
>
> BTW you might also want to set primary_id() to undef (as the ID found
> in the FASTA files doesn't really count as a primary database-
> specific ID anyway). The identifier column in bioentry (which
> primary_id() maps to) is constrained to 40 chars as well.
>
> -hilmar
>
> On Jan 27, 2007, at 9:31 PM, George Heller wrote:
>
> > Hi Hilmar,
> >
> > I tried the lookup and noupdate options, and also made changes to
> > the SeqProcessor.pm package for the accession,
> > package SeqProcessor::Accession;
> > use strict;
> > use vars qw(@ISA);
> > use Bio::Seq::BaseSeqProcessor;
> > use Bio::SeqFeature::Generic;
> > @ISA = qw(Bio::Seq::BaseSeqProcessor);
> > sub process_seq
> > {
> > my ($self, $seq) = @_;
> > my @ids = split(/|/,$seq->display_id);
> > $seq->accession_number($ids[0]);
> > return ($seq);
> > }
> > 1;
> > I invoke the load_seqdatabase.pl as,
> > perl load_seqdatabase.pl -host localhost -dbname usda-06 -format
> > fasta -dbuser postgres -driver Pg --lookup --noupdate --
> > pipeline="SeqProcessor::Accession" maize_pep.fasta
> > Loading maize_pep.fasta ...
> > I get the error,
> > -------------------- WARNING ---------------------
> > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> > were ("FGENESHT0000021||AC155633|113788|
> > 114708|-1","FGENESHT0000021||AC155633|113788|
> > 114708|-1","FGENESHT0000021||AC155633|113788|114708|-1","","0","")
> > FKs (1,)
> > ERROR: value too long for type character varying(40)
> > ---------------------------------------------------
> > Could not store FGENESHT0000021||AC155633|113788|114708|-1:
> > ------------- EXCEPTION -------------
> > MSG: error while executing statement in
> > Bio::DB::BioSQL::SeqAdaptor::find_by_unique_key: ERROR: current
> > transaction is aborted, commands ignored until end of transaction
> > block
> > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /
> > home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:951
> > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
> > home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:855
> > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/akar/
> > local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:205
> > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /home/akar/
> > local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
> > STACK Bio::DB::Persistent::PersistentObject::store /home/akar/local/
> > perl//Bio/DB/Persistent/PersistentObject.pm:272
> > STACK (eval) load_seqdatabase.pl:620
> > STACK toplevel load_seqdatabase.pl:602
> > --------------------------------------
> > at load_seqdatabase.pl line 633
> > As far as I gather, this error shouldnt appear as we are
> > filtering out the accession as only the first code that appears.
> > Ideas?
> >
> > George.
> >
> >
> > Hilmar Lapp wrote:
> > George,
> >
> > I don't know you create the FASTA file, but that's probably where  
> the
> > root cause is. Based on the message:
> >
> >> -------------------- WARNING ---------------------
> >> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> >> were ("FGENESHT0000001||AC155633|570|4400|1","FGENESHT0000001||
> >> AC155633|570|4400|1","FGENESHT0000001||AC155633|570|4400|
> >> 1","","0","") FKs (1,)
> >> ERROR: duplicate key violates unique constraint
> >> "bioentry_accession_key"
> >> ---------------------------------------------------
> >
> > the identifier and accession number are set, so your SeqProcessor
> > scriptlet was executed (otherwise you'd also have seen a dynamic
> > loading error it e.g. you perl class could be not be found or loaded
> > by perl). If you still receive the duplicate key violation, then it
> > can only mean that indeed a sequence with the exact same accession
> > number was in the database already.
> >
> > There are different possibilities for why: you may have loaded the
> > same file before (use --lookup and related switches if you want to
> > update existing sequences), or your FASTA file contains multiple
> > sequences with the same ID, or you have a sequence with the same ID
> > in different FASTA files, if you are loading from more than one  
> file.
> > In either of the two latter cases, you will need to find a way to
> > disambiguate the IDs.
> >
> > BTW you also want to consider to parse the concatenated ID
> > 'FGENESHT0000001||AC155633|570|4400|1' apart into its component IDs,
> > and then use only one component. For example:
> >
> > my @ids = split(/|/,$seq->display_id);
> > $seq->accession_number($ids[0]);
> >
> > Obviously, this will only make for a nicer accession number, and not
> > solve your duplicate ID problem, as the latter is in the file(s) you
> > load.
> >
> > -hilmar
> >
> > On Jan 25, 2007, at 8:51 PM, George Heller wrote:
> >
> >> Hi Hilmar,
> >>
> >> I still seem to be having problems loading my fasta file. I wrote a
> >> new package, SeqProcessor.pm as below,
> >>
> >> package SeqProcessor::Accession;
> >> use strict;
> >> use vars qw(@ISA);
> >> use Bio::Seq::BaseSeqProcessor;
> >> use Bio::SeqFeature::Generic;
> >> @ISA = qw(Bio::Seq::BaseSeqProcessor);
> >> sub process_seq
> >> {
> >> my ($self, $seq) = @_;
> >> $seq->accession_number($seq->display_id);
> >> return ($seq);
> >> }
> >> 1;
> >> I have this file SeqProcessor.pm in my home directory, and I have
> >> set the PERL5LIB variable accordingly. When I run
> >> load_seqdatabase.pl,
> >>
> >> perl load_seqdatabase.pl -host localhost -dbname biodb -format
> >> fasta -dbuser postgres -driver Pg --
> >> pipeline="SeqProcessor::Accession" maize_pep.fasta
> >>
> >> I still get the error,
> >>
> >> Loading maize_pep.fasta ...
> >> -------------------- WARNING ---------------------
> >> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values
> >> were ("FGENESHT0000001||AC155633|570|4400|1","FGENESHT0000001||
> >> AC155633|570|4400|1","FGENESHT0000001||AC155633|570|4400|
> >> 1","","0","") FKs (1,)
> >> ERROR: duplicate key violates unique constraint
> >> "bioentry_accession_key"
> >> ---------------------------------------------------
> >> Could not store FGENESHT0000001||AC155633|570|4400|1:
> >> ------------- EXCEPTION -------------
> >> MSG: error while executing statement in
> >> Bio::DB::BioSQL::SeqAdaptor::find_by_unique_key: ERROR: current
> >> transaction is aborted, commands ignored until end of transaction
> >> block
> >> STACK  
> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /
> >> home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:951
> >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
> >> home/akar/local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:855
> >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/akar/
> >> local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:205
> >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /home/akar/
> >> local/perl//Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254
> >> STACK Bio::DB::Persistent::PersistentObject::store /home/akar/ 
> local/
> >> perl//Bio/DB/Persistent/PersistentObject.pm:272
> >> STACK (eval) load_seqdatabase.pl:620
> >> STACK toplevel load_seqdatabase.pl:602
> >> --------------------------------------
> >> at load_seqdatabase.pl line 633
> >> Is there something I am missing?
> >>
> >> Thanks!
> >> George.
> >>
> >>
> >> Hilmar Lapp wrote:
> >> Hi George, sorry for the sluggish response, I was tied up during  
> the
> >> week. This is also why you always want to keep the thread on the
> >> list.
> >>
> >> Perl is an interpreted language, so no compilation is necessary.  
> The
> >> only thing you need to do is have the package in a place where perl
> >> can find it. The simplest way to achieve this is by setting the
> >> PERL5LIB environment variable:
> >>
> >> $ export PERL5LIB=/where/you/put/your/perl/package
> >>
> >> or if PERL5LIB was set already, you'd append it:
> >>
> >> $ export PERL5LIB=${PERL5LIB}:/where/you/put/your/perl/package
> >>
> >> I do assume that you didn't really add your code to the  
> SeqAdaptor.pm
> >> package - there is no necessity for nor benefit from that, and at
> >> worst (and quite likely) perl won't be able to find the package.  
> Note
> >> that there is plenty of documentation for how to write packages for
> >> perl and how to make them accessible to perl.
> >>
> >> Hth,
> >>
> >> -hilmar
> >>
> >> On Jan 8, 2007, at 11:52 PM, George Heller wrote:
> >>
> >>> Hi Hilmer.
> >>>
> >>> Thanks so much for the response. As I am new to Bioperl, I have
> >>> another question.
> >>>
> >>> I have made the changes as suggested by you, and have added the
> >>> code below to the SeqAdaptor.pm script.
> >>>
> >>> package SeqProcessor::Accession;
> >>> use strict;
> >>> use vars qw(@ISA);
> >>> use Bio::Seq::BaseSeqProcessor;
> >>> use Bio::SeqFeature::Generic;
> >>>
> >>> @ISA = qw(Bio::Seq::BaseSeqProcessor);
> >>>
> >>> sub process_seq
> >>> {
> >>> my ($self, $seq) = @_;
> >>> $seq->accession_number($seq->display_id);
> >>> return ($seq);
> >>> }
> >>>
> >>> Now that I have done my changes, do I need to compile or something
> >>> for the changes to reflect? If so, can you please let me know the
> >>> command for the same, or direct me to any lin that has
> >>> documentation for the same?
> >>>
> >>> Thanks so much for the help.
> >>> George.
> >>>
> >>> Hilmar Lapp wrote:
> >>> George,
> >>>
> >>> this is almost certainly caused by using FASTA format and  
> bioperl's
> >>> treatment of it. I am guilty of not having written a FAQ yet for
> >>> Bioperl-db, as this would certainly be there.
> >>>
> >>> Specifically, the Bioperl fasta SeqIO parser (load_seqdatabase.pl
> >>> uses Bioperl to parse sequence files) does not extract the  
> accession
> >>> number from the description line of the fasta sequence, and  
> instead
> >>> sets the accession_number property if sequence objects it  
> creates to
> >>> "unknown". Since there is a unique key constraint on
> >>> (accession,version,namespace) the second sequence loaded will  
> raise
> >>> an exception as it will violate the constraint.
> >>>
> >>> The simplest way to deal with this is to write a SeqProcessor that
> >>> massages the accession_number appropriately and then supply the
> >>> module to load_seqdatabase.pl using the --pipeline command line
> >>> switch.
> >>>
> >>> There are several examples for how to do this in the email  
> archives.
> >>> See for example this thread on the Biosql list:
> >>>
> >>> http://lists.open-bio.org/pipermail/biosql-l/2005-August/ 
> 000901.html
> >>>
> >>> with two links to examples, and Marc Logghe gives another one  
> in the
> >>> thread itself.
> >>>
> >>> Hth,
> >>>
> >>> -hilmar
> >>>
> >>> On Jan 8, 2007, at 3:17 PM, George Heller wrote:
> >>>
> >>>> Hi all.
> >>>>
> >>>> I am new to Bioperl and am trying to run the load_seqdatabase.pl
> >>>> script to load sequence data from a file into Postgres  
> database. I
> >>>> am invoking the script through the following command:
> >>>>
> >>>> perl load_seqdatabase.pl -host localhost -dbname biodb06 -format
> >>>> fasta
> >>>> -dbuser postgres -driver Pg
> >>>>
> >>>> I am getting the following error:
> >>>>
> >>>> -------------------- WARNING ---------------------
> >>>> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed,  
> values
> >>>> were ("FGENES
> >>>> HT0000001||AC155633|570|4400|1","FGENESHT0000001||AC155633|570|
> >> 4400|
> >>>> 1","unknown"
> >>>> ,"","0","") FKs (1,)
> >>>> ERROR: duplicate key violates unique constraint
> >>>> "bioentry_accession_key"
> >>>> ---------------------------------------------------
> >>>> Could not store unknown:
> >>>> ------------- EXCEPTION -------------
> >>>> MSG: error while executing statement in
> >>>> Bio::DB::BioSQL::SeqAdaptor::find_by_uni
> >>>> que_key: ERROR: current transaction is aborted, commands ignored
> >>>> until end of t
> >>>> ransaction block
> >>>> STACK
> >>>> Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key / 
> usr/
> >>>> lib/perl
> >>>> 5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:948
> >>>> STACK
> >> Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /
> >>>> usr/lib/perl5
> >>>> /site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:852
> >>>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/
> >>>> perl5/site_perl/5
> >>>> .8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:203
> >>>> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/
> >> perl5/
> >>>> site_perl/5.
> >>>> 8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
> >>>> STACK Bio::DB::Persistent::PersistentObject::store /usr/lib/ 
> perl5/
> >>>> site_perl/5.8.
> >>>> 5/Bio/DB/Persistent/PersistentObject.pm:271
> >>>> STACK (eval) load_seqdatabase.pl:620
> >>>> STACK toplevel load_seqdatabase.pl:602
> >>>> --------------------------------------
> >>>> at load_seqdatabase.pl line 633
> >>>>
> >>>> Can anyone tell me how I can correct this error and get my script
> >>>> running? Thanks!!!
> >>>>
> >>>> George.
> >>>>
> >>>>
> >>>> __________________________________________________
> >>>> Do You Yahoo!?
> >>>> Tired of spam? Yahoo! Mail has the best spam protection around
> >>>> http://mail.yahoo.com
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>> --
> >>> ===========================================================
> >>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> >>> ===========================================================
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> __________________________________________________
> >>> Do You Yahoo!?
> >>> Tired of spam? Yahoo! Mail has the best spam protection around
> >>> http://mail.yahoo.com
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Looking for earth-friendly autos?
> >> Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
> >
> > --
> > ===========================================================
> > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> > ===========================================================
> >
> >
> >
> >
> >
> >
> >
> >
> > ---------------------------------
> > Check out the all-new Yahoo! Mail beta - Fire up a more powerful
> > email and get things done faster.
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
>
>
> Don't pick lemons.
> See all the new 2007 cars at Yahoo! Autos.

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================







More information about the Bioperl-l mailing list