[Bioperl-l] Bio::SeqIO::swiss species parsing bug?

David Gonzalez gonzaled at tcd.ie
Fri Aug 17 13:03:35 EDT 2007


	Hi,

	I had a problem with a swissprot file in which the genus and species
were being left undefined, and I believe it could be a bug in the
swiss.pm module.


	When I tried to parse the file with Bio::SeqIO, I got the following
error messages:

Use of uninitialized value in pattern match (m//) at
/sw/lib/perl5/5.8.6/Bio/SeqIO/swiss.pm line 965, <GEN0> line 12.
Use of uninitialized value in string eq at
/sw/lib/perl5/5.8.6/Bio/SeqIO/swiss.pm line 967, <GEN0> line 12.

	The fields I wanted from the file (gene_id , etc.. ) were fine however,
so it was being parsed.

	I checked the output with Data::Dumper and I found the following in the
species entry; the species is left undefined, and the common name is absent.

 	'species' => bless( {
                             '_ncbi_taxid' => 'Not',
                             '_classification' => [
                                                   	undef,
                                                   	undef,
                                                   	'Aedes',
                                                  						    	'Culicini',
                                                        'Culicinae',
                                                        'Culicidae',
                                                        'Culicoidea',
                                                        'Nematocera',
                                                        'Diptera',
                                                        'Endopterygota',
                                                        'Neoptera',
                                                        'Pterygota',
                                                        'Insecta',
                                                        'Hexapoda',
                               							'Arthropoda',
                                         							'Metazoa',
                                                        'Eukaryota'
                                                            ]
                                     }, 'Bio::Species' ),

	The species line in the file is formatted according to the swissprot
specifications and includes a common name

OS   Aedes aegypti (yellow fever mosquito)
OC   Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera;
OC   Endopterygota; Diptera; Nematocera; Culicoidea; Culicidae; Culicinae;
OC   Culicini; Aedes.
OX   NCBI_TaxID=Not defined;

	I think the problem is in the line 905 of the swiss.pm file:

902	if(/^OS\s+(\S.+)/ && (! defined($binomial))) {
903	    $osline .= " " if $osline;
904	    $osline .= $1;
905	    if($osline =~ s/(,|, and|\.)$//) {
906		($binomial, $descr) = $osline =~ /(\S[^\(]+)(.*)/;
907             ($ns_name) = $binomial;
908             $ns_name =~ s/\s+$//; #####


	The problem seems to be that there are no punctuation signs, so 905
returns false. The swissprot format does not require the line to end in
'.' I think although it normally does. By just removing the requirement
for the substitution the output of Data::Dumper seemed normal

	....
	'_common_name' => 'yellow fever mosquito',
        '_ncbi_taxid' => 'Not',
        '_classification' => [
                              'aegypti',
                              'Aedes',
                              'Culicini',
	....

	I am using the fink installed bioperl:
	bioperl-pm586   1.4-5   Perl module for biology

	I don't know if this has  been reported/solved in the newer versions of
bioperl.

	David

-- 
David Gonzalez Knowles
Smurfit Institute of Genetics
Trinity College
Dublin


More information about the Bioperl-l mailing list