I am trying to reconcile gene trees with species trees, and to do this I 
need the species names to be the same in both cases. The gene trees come 
from a clustering of GenBank coding sequences, and the species trees come 
from the NCBI taxonomy. However, when using BioPerl to extract the species 
info from GenBank entries, it only seems possible to get the first 
three words from the ORGANISM line, which are treated as genus, species, 
and subspecies in Bio::Species. However, in several cases, such as the 
example below, there is more information in the ORGANISM line. I suspect 
that this means that the subspecies name uses more than one word, or that 
the GenBank format is being broken? However, this is also how the names 
appear in the NCBI taxonomy names.dmp file.

The problem seems to be in Bio::SeqIO::genbank->_read_GenBank_Species(). 
There is a special condition there for viruses (the whole of the ORGANISM 
info is put on to the classification array), but the examples I have are 
for chordates (there may be others).

I'd be really grateful for any comments on the best thing for me to do.



