[Bioperl-l] Question about the definition of 'gaps' in blast -m8 output...
dan.bolser at gmail.com
Wed Mar 18 15:31:01 EDT 2009
I'm sure this question comes up again and again, but searching the BioPerl
mailing list didn't turn up any answers (to the second question). Basically
I want to manually merge HSP's into 'contigious hits', and I want to look at
the effect of various parameters on an algorithm to do this. This task
prompted me to run a 'sanity check' on the blast data that I had, and I
found that this check fails to fulfil my expectation of the data. This means
that either I don't understand the data or the results are buggy.
Can someone clarify the definition of the 'gaps' column in the blast -m8
output format for me?
I thought that the column 'gaps' was basically the number of columns in the
HSP that contains a gap character. To test this on my data, I checked the
GAPS + 2 =
LENGTH - abs(QUERY_END - QUERY_START) + LENGTH - abs(HIT_END - HIT_START)
This says that the number of GAPS should be equal to the difference between
the LENGTH of the alignment minus the distance between the START and END
point on either the QUERY or the HIT (+2 for the 'off by one' error
introduced by the two END-START calculations).
10-> MMMMMMMM**MMMM*M <-22
|||| || | | |
20-> MMMM**MMMMM*M*MM <-31
where MISMATCHES = 7, LENGTH = 16, QUERY_END - QUERY_START = 12, and HIT_END
- HIT_START = 11. The formula gives:
7+2(9) = 16-12(4) + 16-11(5)
The formula is correct for 11,282 out of 12,745 HSPs in my dataset (89%),
however it fails for 1,463 cases (11%). Each of these cases has a value of
MISMATCHES smaller than calculated by the formula. The difference is usually
1 or 2, but is seen to go as high as 96, and scales roughly linearly with
the size of GAPS.
Did I misunderstand what the value of GAPS is supposed to mean? How come it
does apparently mean what I thought for so much of the data?
Thanks very much for any help on the above.
More information about the Bioperl-l