[Bioperl-l] SiteMatrix changes

skirov skirov at utk.edu
Thu Aug 31 11:57:51 EDT 2006


>===== Original Message From Sendu Bala <bix at sendu.me.uk> =====
>Stefan Kirov wrote:
>> Perhaps I do not understand your idea, but it seems to me the changes
>> you made to SiteMatrix are wrong. Why did you have to remove the
>> pseudo-counts? The correction can be set to 0 which will disable it
>> ic case this is necessary. Pseudo counts are intended to account for
>> the probabilistic uncertainty.

>
>What has adding the number 1 to some but not all input numbers got to do
>with pseudo counts? Can you explain your thinking?
The code was:
 if ($self->{_corrected}) {
            ${$self->{probA}}[$i] += $self->{_correction};
            ${$self->{probC}}[$i] += $self->{_correction};
            ${$self->{probG}}[$i] += $self->{_correction};
            ${$self->{probT}}[$i] += $self->{_correction};
        }
Add 1 (or the user supplied correction value) to any position that has 0. 
Perhaps you are right (if I understood correctly) and 1 should be added to 
everything if any position contains 0. I am not really sure abut this.

>
>
>> On the other hand the correction should be disabled by default if
>> instead of raw count frequencies are used for the construction of the
>> object (still having 0 is a bad idea).
>
>Why is having 0 a bad idea?
Here is a wikipedia explanation:
"In any observed data set or sample there is the possibility, especially with 
low-probability events and/or small data sets, of a possible event not 
occurring. Its observed frequency is therefore 0, implying a probability of 0. 
This is an oversimplification and is often unhelpful, particularly in 
probability-based machine learning techniques such as artificial neural 
networks and hidden Markov models."

It is correct if the user is creating a
>simple count-based matrix. I don't think the module should be trying to
>do any kind of analysis, especially given that it has no idea of the
>source of its input data. It must just accept what it is given. If a
>user or other module wants to do pseudo-count correction, they can do it
>themselves in the most appropriate way for their data.
You are wrong here- this gives an option to the user since correction can be 
disabled (which should be the case with frequencies.). In most cases pseudo 
counts are necessary and that is why this should be the default behavior.
>
>I can't imagine that sometimes adding 1 is /ever/ an appropriate way of
>doing it, but please explain if it is.
This is parameter so it could be changed. Why 1- search for Laplace's rule of 
succession.
>
>
>> Next, the rules you have enforced for the IUPAC do not make sense to
>> me. For example in case the frequency for A is 0.45, G 0.45, C 0.05
>> and T 0.05, according to you rules the result would be N, which makes
>> no sense.
>
>Why does that make no sense? IUPAC has no concept of frequencies or have
>a cutoff. When there is a chance of all four bases (complete ambiguity),
>the IUPAC code is N. If you want it to return 'R' in this case, the
>IUPAC method would need to be extended to allow input of a user-defined
>threshold defining what frequencies to ignore.
So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and T is 
0.00004 you would have N??? Allowing customer supplied thresholds is not a bad 
idea, you could implement it if you wish. But please do not fix something that 
is not broken.
Stefan

>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list