[Bioperl-l] SiteMatrix changes
hlapp at gmx.net
Thu Aug 31 13:48:49 EDT 2006
On Aug 31, 2006, at 12:47 PM, Sendu Bala wrote:
> skirov wrote:
>> Add 1 (or the user supplied correction
>> value) to any position that has 0. Perhaps you are right (if I
>> understood correctly) and 1 should be added to everything if any
>> position contains 0. I am not really sure abut this.
> If you're going to do the correction, you always do it, not just when
> one of the positions contains 0.
That's mostly true. As the wikipedia piece indicated, you need pseudo-
counts mostly in applications of probabilistic methods to count data
for which at least one of the events has a low (but non-zero)
likelihood of being observed. That is so because when you compute
joint probabilities and only a single probability of many is zero,
then the joint probability is automatically zero too, which is seldom
if ever correct or helpful.
Adding the same pseudo-count to each event count in a Bayesian sense
is like starting with a uniform prior distribution over all events
and then adjust the likelihood estimates based on observed data. If
there are many observations (i.e., N(obs) >> N(events)), the
contribution of the pseudo-counts (i.e., prior distribution) to the
resulting estimates will be negligible. If there are few
observations, the prior (non-informative) distribution will still be
visible in the frequency estimates (reflecting that the amount of
data then doesn't support extreme estimates).
You may add non-uniform pseudo-counts to reflect a non-uniform prior
distribution. However, having zero in a pseudo-count slot means the
prior probability for that event is exactly zero, which would be a
gap in the distribution and I'm not sure you would ever desire that.
BTW just to state the obvious, when computing the frequency estimate,
your sum of pseudo-counts must be added in the denominator.
As for applying pseudo-counts by default, I would certainly not do
this. For people who don't read the documentation (which would
document the pseudo-count option) and feed the module with
insufficient data, I think it is always better to return obviously
non-sensical results, than return results that at least on the
surface look sensible but require interpretation, since with using
pseudo-counts you need to know how much data you have seen in order
to disambiguate whether the estimates you are seeing truly reflect
your observations or rather mirror the influence of the pseudo-counts.
>>> IUPAC has no concept of frequencies or have a cutoff. When there is
>>> a chance of all four bases (complete ambiguity), the IUPAC code is
>>> N. If you want it to return 'R' in this case, the IUPAC method
>>> would need to be extended to allow input of a user-defined
>>> threshold defining what frequencies to ignore.
>> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
>> T is 0.00004 you would have N???
> Yes, because that is correct. There's no way to judge automatically
> frequencies should be ignored, unless we know exactly how and if
> count correction was done. Not guessing and being correct is better
> guessing to give 'nicer' answers but sometimes being completely wrong
> (and always being strictly wrong). Really, if you want a nice IUPAC
> string you must simply chose not to do pseudo count correction.
>> Allowing customer supplied thresholds is not a bad idea, you could
>> implement it if you wish. But please do not fix something that is not
> It was giving the wrong answers, so needed fixing. I'll add the
> threshold option.
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
More information about the Bioperl-l