[Bioperl-l] Question about the definition of 'gaps' in blast -m8 output...

Phillip San Miguel pmiguel at purdue.edu
Tue Mar 24 07:53:10 EDT 2009

Dan Bolser wrote:
> 2009/3/19 Phillip San Miguel <pmiguel at purdue.edu>:
>> Dan Bolser wrote:
>>> 2009/3/18 Phillip San Miguel <pmiguel at purdue.edu>
>>>> Dan Bolser wrote:
>>>>> Can someone clarify the definition of the 'gaps' column in the blast -m8
>>>>> output format for me?
>>>>> I thought that the column 'gaps' was basically the number of columns in
>>>>> the
>>>>> HSP that contains a gap character.
>>>> Hi Dan,
>>>> "gaps", to me, denotes the number of gaps. Not the total length of all
>>>> the
>>>> gaps.
>>>> Just my interpretation, but given your results my guess is that whomever
>>>> wrote blastall was thinking the way I do.
>>> Yeah, I'll have to go look at the HSPs to confirm this... I'm just
>>> surprised
>>> that there are not more gaps of length >1. i.e. my data (given your
>>> interpretation) suggests that 90% of the HSPs have no gaps > length 1.
>> Sounds about right. Depends on how you have gap opening vs gap lengthening
>> parameters set.
> I see. I thought that by default extension was less than opening, so I
> had expected there to be more gaps of length >1 ... anyway... where
> can I read more about selecting parameters for certain tasks?
> Currently I'm blasting tomato against potato sequence, and the two
> organisms are known to be 'highly syntenic' - I'm just not sure how
> that translates into how I should set the parameters. I'm after large
> alignments of large regions of the chromosome. My thinking is to just
> run through the list of HSPs and merge based on gap / window size
> (dynamic programming style) - that way I can play with the set of HSPs
> that I have, and look at the effect of different settings, then I can
> just globally align the matching regions using SW (if I need to). Does
> that sound reasonable, or is using the default settings just dumb?
> Cheers,
> Dan.
Hi Dan,
    Sorry, I didn't mean to suggest that the only reason you were seeing 
a preponderance of single base indels was due to your settings. I do 
expect single base indels to outnumber longer indels.
    Nevertheless, I always thought standard alignment tools should not 
use a linear gap extension penalty. That past some point, extending a 
gap should be further "discounted". Maybe the gap extension penalty 
should be the log of the number of bases extended?
    BTW, I just noticed that the blastall '-m 9' parameter, includes the 
column headers. They are:

# Fields: Query id, Subject id, % identity, alignment length, 
mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, 
bit score

So, the column in question is "gap openings".

More information about the Bioperl-l mailing list