[Bioperl-l] SiteMatrix changes
skirov at utk.edu
Thu Aug 31 15:35:42 EDT 2006
>===== Original Message From Sendu Bala <bix at sendu.me.uk> =====
>>>> IUPAC has no concept of frequencies or have a cutoff. When there is
>>>> a chance of all four bases (complete ambiguity), the IUPAC code is
>>>> N. If you want it to return 'R' in this case, the IUPAC method
>>>> would need to be extended to allow input of a user-defined
>>>> threshold defining what frequencies to ignore.
>>> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
>>> T is 0.00004 you would have N???
>>> Yes, because that is correct.
>> No it is not- IUPAC is an approximation of a PFM, which is very different
>> what you are doing.
>The IUPAC-IUB nomenclature for incompletely specified bases in nucleic
>acid sequences is just that and no more. They were not trying to
>'approximate a PFM'. It is a nomenclature for describing uncertainty.
>If you tell me that you are uncertain what the nucleic acid is by giving
>me a non-zero probability of having all nucleic acids, the appropriate
>IUPAC symbol is 'N'.
Hmm, so would you apply this rule for standard consensus as well?
>>> There's no way to judge automatically what
>>> frequencies should be ignored, unless we know exactly how and if psuedo
>>> count correction was done. Not guessing and being correct is better than
>>> guessing to give 'nicer' answers but sometimes being completely wrong
>>> (and always being strictly wrong). Really, if you want a nice IUPAC
>>> string you must simply chose not to do pseudo count correction.
>> Sure, when you create the object add -correction=>0.
>Right, so you're happy with getting IUPAC strings of all 'N' unless
>-correction => 0 has been used? When a -correction is requested, the
>only way to get non Ns should be to supply a threshold. There could be a
>default threshold, but what would it be?
Good question. I have come up with some intuitive values that have worked for
me (my specific problem at that time anyway) and could be changed... After all
IUPAC is used mainly for simplified representation, not analysis...
>>> Allowing customer supplied thresholds is not a bad idea, you could
>>> implement it if you wish. But please do not fix something that is not
>>> It was giving the wrong answers, so needed fixing. I'll add the
>>> threshold option.
>> No it was not giving wrong answers, the error was not in this function, so
>> did not need fixing.
>If you revert just _calculate_consensus() and _to_IUPAC() to the
>previous implementation, test 34 on line 144 of t/psm.t gives the answer:
>having parsed t/data/transfac.dat
>That file has the matrix:
>P0 A C G T
>01 1 2 2 0
>02 2 1 2 0
>03 3 0 1 1
>04 0 5 0 0
>05 5 0 0 0
>06 0 0 4 1
>07 0 1 4 0
>08 0 0 0 5
>09 0 0 5 0
>10 0 1 2 2
>11 0 2 0 3
>12 1 0 3 1
>The correct answer here is:
>To call positions 6 and 7 Gs is wrong. This is the problem with having a
>default threshold. It is dangerous in sometimes giving the wrong answer
>and the user would never realize.
I would not call the answer wrong (or right). The behavior here could be
easily fixed by inversing the logic by searching for the most relaxed
combination that surpasses the given threshold first.
>If we revert all my changes the answer given is:
I would say that this is the right answer in this case.
>If you think you can fix this bug another way (that isn't specific to
>Bio::Matrix::PSM::IO::transfac), I'll revert the changes and create a
>bug report for you.
Sounds OK with me. To summarize:
1. Correction is disabled by default.
2. Correction should be applied to all positions.
3. Thresholds for IUPAC consensus can be user defined.
4. A fix for IUPAC consensus calculation: change the defaukt behavior.
5. Document the options
Does this sounds right?
More information about the Bioperl-l