Copyright © 2005 jsd
For any µ and ν, the joint probability that µ takes on the value a and ν takes on the value b will be denoted P(µ:x, ν:y).
The conditional probability that µ takes on the value a, given that ν takes on the value b, will be denoted P(µ:x | ν:y).
Similarly, the conditional probability that ν takes on the value b, given that µ takes on the value a, will be denoted P(ν:y | µ:x).
Note that P(µ:3 | ν:7) is a completely different concept from P(ν:3 | µ:7) ... so we can see why the conventional notation of just writing P(µ | ν) is abominable: it leads you to write things like P(3 | 7) which is hopelessly ambiguous.
The unconditional probability that µ takes on the value a is denoted P(µ:x).
Some basic identities include:
|P(ν:y | µ:x) =|
which is called the Bayes inversion formula.
In the case of Poisson distribution, suppose we know (based on an ensemble of measurements) that the expected count in a certain interval is µ. Then the probability that in a given trial we observe ν counts in the interval is:
|P(ν:y | µ:x) = e−x|
It is an easy and amusing exercise to verify that this formula normalized so that
|P(ν:y | µ:x) = 1 for all x (4)|
as it should.
Also by way of terminology, note that the term likelihood is defined to mean the a priori probability, namely P(data | model). In the data-analysis business, maximum-likelihood methods are exceedingly common ... even though this is almost never what you want. We don’t need a formula that tells the probability of the data; we already have the data! What we want is a formula that allows us to infer the model, given the data. That is, we want a maximum-a-posteriori (MAP) estimator, based on P(model | data).
This is not a solution to the general problem, but only an example ... sort of a warm-up exercise.
Suppose we have a molecule that can be in one of three states, called Bright, Fringe, and Dark. The probability of finding the molecule in each of these states is b, f, and d respectively. Similarly, in each of these states, the variable µ takes on the values B, F, and D respectively. That is,
where those are the prior probabilities, i.e. the probabilities we assign before we have looked at the data. Looking at it another way, these prior probabilities tell us where the molecule is “on average”, and we will use the data to tell us where the molecule is at a particular time.
So now let’s look at the data, and see what it can tell us.
Using the Bayes inversion formula (equation 2), we can write
|P(µ:B | ν:y) =|
|for all y (6)|
Next, we plug equation 3 into the RHS to obtain:
where this is the posterior probability, i.e. the probability we infer after having seen that ν takes on the value y. (If there were more than three states, you could easily generalize this expression by adding more terms in the denominator.)
You can check that this expression is well behaved. A simple exercise is to check the limit where B, F, and D are all equal; we see that the data doesn’t tell us anything. Another nice simple exercise is to consider the case where B is large and the other two rates are small. Then if y is small, we have overwhelming evidence against the Bright state ... and conversely when y is large, the probability of the Bright state closely approaches 100%.
From the keen-grasp-of-the-obvious department: The functional form of P(ν:y | µ:B) (equation 3) is noticeably different from the functional form of P(µ:B | ν:y) (equation 7). The latter is what you want to optimize by adjusting the fitting parameters, given the data.
If there are multiple independent observations, the overall posterior probability will have multiple factors of the form given in equation 7.
If we want something that can be summed over all observations and minimized, take the negative of the logarithm of equation 7.
If (big if!) the data is normally distributed then you can compute the negative-log-likelihood by simply summing the square of the residuals. This is the central idea behind the widely used (wildly abused) “least squares” method. Least squares is unsuitable for our purposes for at least two primary reasons: First of all we want MAP, not maximum-likelihood, and secondly our data is nowhere near normally distributed.
We want least log improbability, not least squares.
Copyright © 2005 jsd