[Bioperl-l] Problems with "tblastx"-like output and GFF

Neil Saunders neil.saunders at unsw.edu.au
Tue Nov 2 20:49:32 EST 2004

Dear all,

I am working on ways to visualise conserved gene clusters (or gene 
synteny) in microbial genomes.  Basically, what I want to do is (1) 
compare a set of contigs to 1 or more finished genomes using one or all 
of blast, blat and MUMmer, (2) convert the output to GFF including 
'match' lines and (3) visualise in Generic Genome Browser.  Along the 
way I've come across a few problems to share with you - I'm 
cross-posting to bioperl and gbrowse lists.

For comparing genomes in this way a "tblastx-like" approach is best, as 
proteins are more conserved than DNA.  This requires use of:

(1) tblastx
(2) blat with "t=dnax -q=dnax" (which I'll call "blatx")
(3) promer from the MUMmer package

So far so good.  Problems arise when parsing these outputs to generate 
GFF.  My problems are:

(1) Bio::Tools::Blat.
When I use this module to parse "blatx" output (just following the 
example in the module docs), I get errors like so:

Argument "++" isn't numeric in numeric ne (!=) at 
/usr/local/share/perl/5.8.4/Bio/Location/Atomic.pm line 170, <GEN0> line 

"Blatx" has a strand column containing both query and target strand 
(e.g. "++") and it seems Blat.pm is not happy with this?  This looks 
like an easy fix.

(2) Bio::SearchIO::psl
This module gives me errors such as:

Use of uninitialized value in pattern match (m//) at 
/usr/local/share/perl/5.8.4/Bio/SearchIO/psl.pm line 176, <GEN1> line 

I'm not entirely sure what to make of this, as the psl file looks OK.

(3) search2gff and blast2gff
I've used the search2gff script from bioperl and the blast2gff script 
from gbrowse.  What I found was:

- they both work fine on blastn output
- search2gff with the "--match" switch works fine on blastn output, but 
gives errors on tblastx output
- blast2gff gives similar errors to search2gff when trying to construct 
a match line from tblastx output

The errors when trying to create "match" lines look like this:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Undefined sub-sequence (2400764,2400765). Valid range = 2400764 - 
STACK: Error::throw
STACK: Bio::Root::Root::throw 
STACK: Bio::Search::HSP::HSPI::matches 
STACK: Bio::Search::SearchUtils::_adjust_contigs 
STACK: Bio::Search::SearchUtils::tile_hsps 
STACK: Bio::Search::Hit::GenericHit::strand 
STACK: /usr/bin/bp_search2gff.pl:177

and the values in parentheses after "Undefined sub-sequence" always 
differ by one (as above, 2400764,2400765).

I had some discussion with Jason S. about this issue some months ago and 
we concluded there was some complicated problem in the internals of 

What all this boils down to is that "tblastx-like" output is quite hard 
to deal with.  I have taken to writing my own scripts where I:

1) store GFF lines for single HSPs in an array
2) loop through and create a hash of hashes of arrays with keys query_id 
+ target_id and an array of starts/ends
3) sort the array and use the first+last values to generate the "match" 

This works OK (I've done this for promer too), but is a bit basic (e.g. 
"match" lines are not given scores).  I have an example of how this 
looks for promer at:


If anyone is interested, I can provide output files and sample scripts 
and perhaps we can work this issue.  I use:

Debian Linux (unstable)
Latest CVS bioperl and gbrowse
BLAST 2.2.8
MUMmer 3.15
BLAT v. 17

thanks for your ideas,
 School of Biotechnology and Biomolecular Sciences,
 The University of New South Wales,
 Sydney 2052,


More information about the Bioperl-l mailing list