[Bioperl-l] load_seqdatabase error with a specific locus from genbank

Hilmar Lapp hlapp at gmx.net
Mon Apr 6 11:39:50 EDT 2009

(Removing biosql-l from the cc list as this seems to be a problem with  

Hi Johann,

I don't know whether anyone has responded to you yet - if not I'm  
sorry, I've been inundated for the past couple test.

On Apr 1, 2009, at 6:14 AM, Johann PELLET wrote:

> With the latest version of BioPerl and BioSQL, I have tried to  
> insert entry from a GenBank file, which I have downloaded from the  
> NCBI website (648 937 records)

Could you be more specific? When you say the latest version of  
BioPerl, do you mean 1.6.1 or the current svn snapshot of the main  

And which Genbank file is it? Is it one with only viruses, i.e., are  
you specifically interested in the virus sequences that the parser is  
giving you trouble with?

> After successfully loading ncbi_taxonomy i am getting following  
> error message while loading sequences into database.
> perl load_seqdatabase.pl gb_03-2009 -format genbank -driver Pg - 
> dbname biosql
> --------------------- WARNING ---------------------
> MSG: The supplied lineage does not start near 'Human papillomavirus  
> type 2c' (I was supplied 'Human papillomavirus - 2 |  
> Alphapapillomavirus | Papillomaviridae')

This is a problem in the BioPerl genbank parser, or more specifically,  
in the species parser.

I thought though this was fixed in 1.6.1; are you sure you don't have  
an older version of BioPerl lying around that could accidentally have  
been used?

That said, it only seems to be a warning; did you check how the record  
ended up in the database and found it to be incomplete or messed up?

> the script is not stopped until this entry: S67864

This a later entry, not the same entry that causes the problem above,  

> --------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::LocationAdaptor (driver) failed,  
> values were ("1","19)","1","3") FKs (41914,<NULL>)
> ERROR:  invalid input syntax for integer: "19)"

Oops - that's a problem that must originate from the BioPerl feature  
location parser.

The full record is here: http://www.ncbi.nlm.nih.gov/nuccore/544772

Does anyone see why the location parser should have a problem with the  
first gene feature? It's nested, and has remote location components,  
but at first sight nothing jumps out at me as extraordinary. Has  
someone recently changed the location parsing code? If no-one has an  
immediate idea what could be at work here, this needs investigating.

: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :

More information about the Bioperl-l mailing list