[Contents]

Copyright © 2015 jsd

Sampling Error / Bernoulli Process / Random Walk
John Denker

*   Contents

1  Example: NPR versus Bernoulli
1.1  Original NPR Story
1.2  A First Step in the Right Direction
1.3  Mahalanobis Distance
2  A Model System
3  Derivation of some Key Formulas
3.1  Multi-Dimensional Random Walk
3.2  Sampling and Polling
3.2.1  Example: 60:40 Coin Toss
3.2.2  Example: 49-49-2 Polling
3.3  Probability of Seeing Zero
3.4  Lopsided Error Bars
4  Philosophical and Pedagogical Remarks
5  Mutations
5.1  Systematic Error
5.2  Possible Origin of the Bogus «Sampling Error» Numbers
6  References

1  Example: NPR versus Bernoulli

1.1  Original NPR Story

Recently, on the NPR web site, I saw an article that said in part:

We’ve put together a chart showing how the candidates stack up against each other [...] and how much their margins of error overlap. [...] there’s a +/- 3.8-percentage-point margin of error, so we created bars around each estimate showing that margin of error.

The article (reference 1) supported this point using the graphic in figure 1.

npr-polling-math
Figure 1: Polling Results, with Alleged Error Bars

I don’t know whether to laugh or cry when I see so-called “error bars” like that. The 3.8% is calculated based on the number of people surveyed ... using a formula that has no basis in mathematics, no basis in reality of any kind. Oddly enough, the article links to a document (reference 2) that contains better formulas, but nevertheless gets the wrong answer, saying “those margins might be a little smaller at the low end of the spectrum” when in fact the margins are drastically smaller.

1.2  A First Step in the Right Direction

This discussion uses basic probability ideas, as discussed in reference 3.

The whole concept of “error bars” is problematic whenever we are dealing with more than one variable, especially when we know the variables are correlated, extra-especially when the underlying distribution is non-Gaussian. However, let’s temporarily pretend we didn’t notice that, and try to make some modest improvements within the error-bar framework.

The black error bars in figure 2 show a somewhat more reasonable analysis of the situation. There is some overlap of the error bars involving 10th place and 11th place, but there’s nothing special about that; there have been close contests throughout history, for as long as there have been contests. However, my point is, in this case the error bars do not overlap between 9th place and 11th place. Also the error bars do not overlap between 10th place and 12th place. Perhaps more significantly, the error bars for 8th place do not come anywhere near overlapping the error bars for 5th place.

bernoulli-process
Figure 2: Polling Results, with Better Error Bars

The 17th row in figure 2 shows the ever-popular “none of the above” option. It is worth paying some attention to this, if we want our probabilities add up to 100%.

The error bars are not symmetrical, because they take into account the fact that we don’t know the underlying probability in the global population, and must estimate it from the data. Figure 2 uses the formulas calculated in section 3.4.

Disclaimer: This is not a 100% industrial-strength professional analysis. In particular, it does not take into account correlations. Also it does not properly take into account the prior (aka base rate) on the model parameters. Also it does not account for systematic error (as discussed in section 5.1.) On the other hand, it does use error bars that are in the right ballpark, and it shows how lopsided error bars can arise when estimating model properties from the data.

1.3  Mahalanobis Distance

Figure 3 shows how expensive it would be to make a pairwise switch in the rankings, starting from the data in figure 1. Specifically, for each pair of candidates, we start by finding how far apart they are in the standings. We then imagine erasing that split, by moving one of them up by half the split, while moving the other down the same amount.

We leave all other candidates unchanged, because that is the cheapest way to erase the difference.

This maneuver moves us to a less-likely place on the cumulative probability distribution. We express this as a distance, namely the Mahalanobis distance, as defined in equation 2. In other words, we are measuring how far the two candidates are apart, measured in units of the standard deviation, i.e. measured relative to the width of the distribution. We use the full covariance matrix to measure the width in the relevant direction, in the relevant subspace, i.e. the subspace that involves taking from one and giving to the other directly.

In the figure, the area of each bubble is proportional to the Mahalanobis distance. Red coloration indicates that the distance is greater than 3 units. In accordance with the usual formula for the Gaussian cumulative probability distribution, three units covers 99.87% of the normal distribution.

npr-polling-maha
Figure 3: Pairwise Mahalanobis Distances

This permits a nice solution to the original political problem. Suppose you think it makes sense a priori to have the six leading candidates on stage ... based on their true standings in the overall population. You cannot directly observe the true standings, so you have to rely on the poll results instead. If you include only the six top-polling candidates, there is some chance that you will exclude one of the guys who should be included. You can see this by reading across the sixth row of the table (or, equivalently, reading down the sixth column). After you accept the top six candidates, there are three more within a not-too-large Mahalanobis distance from the sixth-place guy. That’s a problem, but it can easily be fixed by including those three extra guys, namely the ones that are circled in the figure. Consider them a buffer. There is a very high probability that the six top-true candidates are among the nine top-polling candidates.

Let’s be clear: You can’t pick six guys and be sure they are the true top six. You can however pick nine guys and be very very confident that they include the true top six. Similarly you can pick ten guys and be very confident that they include the true top eight, and super-confident that they include the true top seven. Putting more people on the debate stage would benefit the low-ranking candidates at the expense of the high-ranking candidates. It is reasonable – indeed necessary – to require people to earn a place on the stage. There are plenty of ways of setting a threshold in such a way that one can say with high confidence that anyone below the threshold did not earn a place.

The data in figure 3 is shown in numerical form in table 1.

-0   0   0   0   0   1.9   2.7   3.6   4.6   6   8.1   12   12   12   26   26   4.8     
0   -0   0   0   0   1.9   2.7   3.6   4.6   6   8.1   12   12   12   26   26   4.8     
0   0   -0   0   0   1.9   2.7   3.6   4.6   6   8.1   12   12   12   26   26   4.8     
0   0   0   -0   0   1.9   2.7   3.6   4.6   6   8.1   12   12   12   26   26   4.8     
0   0   0   0   -0   1.9   2.7   3.6   4.6   6   8.1   12   12   12   26   26   4.8     
1.9   1.9   1.9   1.9   1.9   -0   0.72   1.5   2.4   3.6   5.2   8.4   8.4   8.4   18   18   7.2     
2.7   2.7   2.7   2.7   2.7   0.72   -0   0.79   1.7   2.8   4.3   7   7   7   15   15   8.2     
3.6   3.6   3.6   3.6   3.6   1.5   0.79   -0   0.87   1.9   3.3   5.7   5.7   5.7   13   13   9.5     
4.6   4.6   4.6   4.6   4.6   2.4   1.7   0.87   -0   1   2.3   4.4   4.4   4.4   10   10   11     
6   6   6   6   6   3.6   2.8   1.9   1   -0   1.2   3   3   3   7.5   7.5   13     
8.1   8.1   8.1   8.1   8.1   5.2   4.3   3.3   2.3   1.2   -0   1.6   1.6   1.6   4.8   4.8   17     
12   12   12   12   12   8.4   7   5.7   4.4   3   1.6   -0   0   0   2.2   2.2   25     
12   12   12   12   12   8.4   7   5.7   4.4   3   1.6   0   -0   0   2.2   2.2   25     
12   12   12   12   12   8.4   7   5.7   4.4   3   1.6   0   0   -0   2.2   2.2   25     
26   26   26   26   26   18   15   13   10   7.5   4.8   2.2   2.2   2.2   -0   0   50     
26   26   26   26   26   18   15   13   10   7.5   4.8   2.2   2.2   2.2   0   -0   50     
4.8   4.8   4.8   4.8   4.8   7.2   8.2   9.5   11   13   17   25   25   25   50   50   -0     
Table 1: Pairwise Mahalanobis Distances

Disclaimer: This is not a 100% industrial-strength professional analysis. In particular, it does not take into account the fact that the distribution is not Gaussian. Also it does not account for systematic error (as discussed in section 5.1.) On the other hand, it does use error bars that are in the right ballpark, and it shows how correlations can give rise to a distribution that is elongated in some directions, and rotated relative to the “natural” variables.

2  A Model System

Suppose there are two candidates, X and Y. Consider a scenario where some oracle has told us a priori that exactly 49% of the population favors X, and 49% favors Y. The remaining 2% we call Z, no matter whether that represents a third candidate or undecided or "none of the above" or whatever.

We now sample this distribution. Each sample consists of asking N=50 people at random. Since this is a rather small sample, there will be substantial "sampling error".

We have three observable variables, namely the sampled X, sampled Y, and sampled Z. However, they span a space with only two dimensions, because there is a constraint: No matter what, X+Y+Z has to add up to 100%.

There are powerful techniques for visualizing three variables in two dimensions. A good place to start is a ternary plot, as explained in reference 4.

I did a Monte Carlo simulation of the XYZ scenario. Figure 4 shows a scatter plot of the results.

bernoulli-xyz
Figure 4: Scatter Plot: Candidates X, Y and Z

We can learn a tremendous amount from this plot.

As expected, the data points are most densely clustered in the vicinity of [X, Y, Z] = [.49, .49, .02].

Pollsters have a nasty habit of saying that their numbers have a "sampling error" equal to 1/√N (where N is the sample size). In our example, 1/√50 = 14%.

Looking at figure 4, we know that 14% uncertainty is just ridiculous. Each little triangle in the plot is 10 percentage points on a side.

It may be that you can afford to run the poll in the field only one time. That corresponds to just one dot in the scatter plot in figure 4. Constructive suggestion: Use Monte Carlo to simulate the experiment. That gives you lots of dots, at a very low cost. Without that, it’s often next to impossible to figure out the statistics.

Not-very-constructive warning: If you think in terms of “error bars on X” and “error bars on Y” you are guaranteed to get the wrong answer ... and you won’t even know how grossly wrong it is.

Marginally-constructive suggestion: Look at the covariance matrix. This is one of necessary (but not sufficient) things you must do if you want to talk about uncertainty with any pretense of professionalism. The formula for a multi-dimensional Gaussian can be written

dPR) = exp(− dM2 /2)   
             (1)

where dP is the probability density distribution, and dM is the Mahalanobis distance, which is defined in terms of the covariance matrix Σ:

dM2 := ΔR  Σ−1  ΔR
             (2)

The R vector is the independent variable in the problem at hand. For the example discussed in this section and in section 3.2.2, the components of R are the X, Y, Z variables we used to define the problem. That is:

R = 



X
Y
Z



             (3)

As always, R denotes the transpose of R. If R is a column vector, R is a row vector. As a corollary, this means that A B is the dot product A·B (for any two vectors A and B). Also, Σ−1 denotes the matrix inverse of Σ ... or if necessary, the pseudoinverse.

Equation 1 is the natural generalization of the familiar expression for a one-dimensional Gaussian:

dP(x) = exp(− Δx2 / σ2 / 2)
             (4)

In simple cases, the covariance matrix will be more-or-less diagonal using whatever variables you have chosen. Then each diagonal element is the variance i.e. σ2 for the corresponding variable. In such a case, congratulations; you can continue using those variables without too much hassle. However, alas, the problems that land on my desk are almost never simple. In the present case, the covariance matrix is mess. There are tremendous off-diagonal elements. These indicate correlations. Indeed, the correlations are so bad that the matrix is singular, and Σ−1 does not even exist, strictly speaking.

I said that looking at the covariance matrix is "marginally" constructive because (by itself) it provides little more than a warning. It’s like a sign that says you are in the middle of a minefield. You know you’ve got a big problem, but can’t solve it without additional skill and effort.

Singular Value Decomposition sometimes offers you a way out of the minefield. It is especially useful if the data is Gaussian distributed ... and it might provide some qualitative hints even in non-Gaussian situations.

SVD is one of those things (rather like a Fourier transform) that you cannot easily do by hand ... but you can compute it, and then verify that it is correct, and then understand a lot of things by looking at it. In particular, given the SVD, it is trivial to invert the matrix; keep the same eigenvalues, and take the arithmetical reciprocal of the eigenvectors.

Specifically, SVD will give you the eigenvectors and the corresponding eigenvalues of the covariance matrix Σ. In our example, the raw covariance matrix is:




     0.004991−0.004797−0.0001944 
−0.0047970.004995−0.0001985 
−0.0001944−0.00019850.0003929 



 
             (5)

If you unwisely ignore the off-diagonal elements and take the diagonal elements (i.e. the variances) to be the squares of the error bars, you get

0.07065,    0.07068,    0.01982  
             (6)

For two of the variables we (allegedly) have 7% uncertainty. This is half of what pollsters conventionally quote. Beware that 7% is wrong, as discussed below. Meanwhile 14% is conceptually wrong, for a different reason ... unless you are willing to make three mistakes that cancel each other out. As a separate matter, note that the alleged uncertainty on Z is only 2%. That’s not as crazy as 14%, but it’s not exactly right, either.

The eigenvectors of Σ are:




     −0.7069−0.40850.5774 
0.7073−0.4080.5774 
−0.00031350.81650.5774 



 
             (7)

and the corresponding eigenvalues are

0.00979,  0.0005894,  1.275e−14,  
             (8)

The third eigenvalue is actually zero. Because of roundoff errors it is represented here by a super-tiny floating-point number. The zero eigenvalue corresponds to a zero-length error bar in a certain direction. It tells that if we change X by itself, keeping Y and Z constant, we violate the constraint that X+Y+Z must add up to 1. Ditto for changing Y or Z by itself. This corresponds to moving in the third dimension in the ternary plot, i.e. leaving the plane of the paper. It is not allowed. In accordance with equation 4, any movement at all in the direction of a zero-length error bar will cause the probability to vanish.

The first eigenvector is the "cheap" one. You know this because it corresponds to the large eigenvalue of Σ (and hence the small eigenvalue of Σ−1). To a good approximation, it represents increasing X at the expense of Y (and/or vice versa) along a contour of constant Z. You know this by looking at the eigenvector components in the first column. SVD is giving us a lot of useful information.

Taking the square root of the eigenvalue, we find that the error bar is 9.9% in the cheap direction ... not 7% and not 14% but smack in between (geometric average). Meanwhile the calculation suggests an error bar of 2.4% in the other direction, i.e. increasing Z and the expense of X and Y equally. Alas, this is not as useful as one might have hoped, because the distribution is grossly non-Gaussian in this direction. Starting from the peak of the distribution, can move several times 2.4% toward the north, but we cannot move even one times 2.4% toward the south.

Overall, we can get a fairly decent handle on whats going on using a combination of Monte Carlo, ternary plots, and Singular Value Decomposition.

3  Derivation of some Key Formulas

The goal is to understand the statistics of public-opinion polling. In particular, we want to know how much uncertainty attaches to the results, purely as a result of sampling error. (We ignore the all-too-real possibility of methodological problems and other systematic errors.)

Rather than attacking that problem directly, let’s start by doing a closely-related problem, namely the random walk (section 3.1). Once we have figured that out, the solution to the original problem is an easy corollary (section 3.2).

3.1  Multi-Dimensional Random Walk

A random walk is reasonably easy to visualize. The mathematics is not very complicated.

The walk involves motion in D dimensions. At each step, the walker moves a unit distance in one of the D possible directions; there are no diagonal moves. The probability of a step in direction i is pi, for all i from 0 through D−1 inclusive.

The situation is easier to visualize if we imagine some definite not-very-large value if D, but mathematically the value of D isn’t very important. We can always imagine an extra-large value for D, and just set pi=0 for each of the extra dimensions.

This pi represents the underlying probability in the global population. In any particular walk, we do not get to observe pi directly.

Each walk consists of N steps. To keep track of the wallker’s position along the way, we use the vector variable x which has components xi for all i from 0 to D−1. More specifically, we keep track of x(L) as a function of L, where L is the number of steps in the walk, from L=0 to L=N.

To define what we mean by sampling error, we must imagine an ensemble of such walks. There are M elements in the ensemble.

Table 2 shows the position data x(L) for eight random walks. This shows what happens for only one of the dimensions; the table is silent as to what is happening in the other D−1 dimensions.

walk   #1#2#3#4#5#6#7#8
length           
  0   00000000
  1   00000011
  2   01100021
  3   12110031
  4   12221142
  5   13321142
  6   23432153
  7   23442163
  8   24452263
  9   24462274
 10   34572374
 11   45683485
 12   45694486
 13   45795597
 14   568965107
 15   569965107
 16   569966107
 17   679967107
 18   68101077107
 19   78111078118
 20   78111079129
 21   781110791210
 22   791111791311
 23   8911127101412
 24   9911137101413
 25   10911137111413
Table 2: Eight Random Walks

Figure 5 is a plot of the same eight random walks.

biased-walk-8
Figure 5: Eight Random Walks

We now focus attention on the x-values on the bottom row of table 2. When we need to be specific, we will call these the x(N)-values, where in this case N=25. It turns out to be remarkably easy to calculate certain statistical properties of the distribution from which these x-values are drawn.

The mean of the x(N) values would be nonzero, if we just averaged over the eight random walks in table 2. On the other hand, it should be obvious by symmetry that the mean of the x(N)-values is zero, if we average over the infinitely-large ensemble of possible random walks with the given population-base probabilities pi.

We denote the average over the large ensemble as . It must be emphasized that each column in in table 2 is one element of the ensemble. The ensemble average is an average over these columns, and many more columns besides. It is emphatically not an average over rows.

Consider the Lth step in the random walk, i.e. the step that takes us x(L−1) to x(L). Denote this step by Δx. For each i, there is a probability pi that xi is +1 and a probability (1−pi) that it xi is 0.

The calculation goes like this:

xi(L) = xi(L−1) + Δxi     (9a)
xi(L) = xi(L−1) + pi     (9b)
 = L pi     (9c)

To obtain the last line, we used the fact that xi(0) is zero, and used induction on L. We see that on average, each step of the walk moves a distance pi the ith direction.

To understand the spread in the distribution, let’s look at some second-order statistics. We define a new variable

y(L) := x(L) − L p
             (10)

that measures how far our random walker is away from the ensemble-average location.

yi(Lyj(L) = (yi(L−1) + Δxi − pi) (yj(L−1) + Δxj − pj)      
  = yi(L−1) yj(L−1)    
    yi(L−1) (Δxj − pj) + yj(L−1) (Δxi − pi)  
    + (Δxi − pi) (Δxj − pj)    
yi(Lyj(L) = yi(L−1) yj(L−1) + xi − pi) (Δxj − pj)
             (11)

To derive the last line, we used the fact that xipi) is zero ... and is independent of y(L−1).

Let’s start by considering the case where i=j. Then the result of equation 11 simplifies to

yi(L)2 = xi(L−1) xj(L−1) + xi − pi)2     (12a)
 = xi(L−1) xj(L−1) + (1−pi)2 pi + (0−pi)2 (1−pi)     (12b)
 = xi(L−1) xj(L−1) + pi (1−pi)     (12c)
 = L pi (1−pi)     (12d)

To get to equation 12b we used the fact that Δxi is 1 with probability pi, and is 0 with probability (1−pi). To get to equation 12d, we used the fact that yi(0)2 is zero, and used induction on L. This gives us the mean-square distance of the ith coordinate from its average value. (The “mean” in this mean-square distance is the ensemble average.)

Equation 12d is related to a famous result. If we divide by L2, it gives us the variance of a generalized Bernoulli process. Note that a simple Bernoulli process is a model of a coin toss, with two possible outcomes (not necessarily equiprobable). A generalized Bernoulli process a model of a multi-sided die, with D possible outcomes (not necessarily equiprobable).

A generalized Bernoulli distribution is sometimes called a categorical distribution or a multinoulli distribution.

For the off-diagonal case, i.e. ij, there are three possibilities:

We can say the same thing in mathematical terms, as follows:

yi(Lyj(L) = yi(L−1) yj(L−1)xi − pi) (Δxj − pj)     (13a)
 = yi(L−1) yj(L−1)+ (1 − pi) (0 − pjpi  
     + (0 − pi) (1 − pjpj  
     + (0 − pi) (0 − pj) (1 − pi − pj)     (13b)
 = yi(L−1) yj(L−1) − pi pj       (13c)
 = − L pi pj       (13d)

To obtain equation 13d, we once again used induction on L.

3.2  Sampling and Polling

It is conventional and reasonable to report the results of a poll in terms of the average over all responses. This differs from a random walk, which we formulated in terms of a total tally (rather than an average).

To obtain the relevant average, all we need to do is set L=N in equation 9c and then divide both sides by N. That is, we define the vector

a := 
x(N)
N
             (14)

and therefore in accordance with equation 9c

a = p
             (15)

To measure the spread relative to the mean, we apply a similar factor of 1/N to equation 10, which gives us:

b := y(N) / N          
  = a − p
             (16)

The variance (i.e. the diagonal part of the covariance) is

bi2 = pi (1−pi) / N
             (17)

and the rest of the covariance is

bi bj = − pi pj / N
             (18)

When the probability is near 50%, equation 17 yields a familiar formula for the standard deviation:

σ = 0.5/√N    (for p=0.5)
             (19)

This can be compared with equation 40.

3.2.1  Example: 60:40 Coin Toss

Consider the case where there are only two dimensions, i.e. D=2. This corresponds to an ordinary coin toss. Now suppose the coin is bent, so that at each step the two possible outcomes are unevenly distributed, with a 60:40 ratio. Then the covariance matrix is exactly:

Σ = 


p0 (1−p0)  p0 p1    
p1 p0  p1 (1−p1)    


 ÷ N
  = 


0.24     −0.24    
−0.24     0.24    


 ÷ N
             (20)

This Σ is called the covariance matrix. The elements on the diagonal are called the variances. The square root of the variance is called the standard deviation. You can see that the standard deviation is σi = √(pi(1−pi)/N). The usual moderately-naïve practice is to set the so-called error bars equal to the standard deviations determined in this way. This is better than assuming 0.5/√N and vastly better than assuming 1.0/√N ... but still not ideal, because the whole idea of “error bars” is flawed.

In two dimensions, the numbers are particularly simple because in two dimension, there are only two probabilities, and pj = (1−pi). Therefore the diagonal elements pi (1−pi) and the off-diagonal elements −pi pj have all the same magnitude.

This is one of those rare cases where you can do the SVD in your head. The normalized eigenvectors and the corresponding eigenvalues are

eivenvectors:   
½


1
−1


    and   
  
½


1
1


eivenvalues:   pi (1−pi)    and   0
             (21)

You can see that for a fluctuation in the “cheap” direction, the error bar is longer by a factor of √2 compared to what you would expect by naïvely ignoring correlations. In the “expensive” direction, the error bar is shorter by a factor of infinity.

3.2.2  Example: 49-49-2 Polling

Consider a public opinion poll where there are three candidates. X and Y each have 49% support, while Z has the remaining 2%. This is the scenario depicted in figure 4. The covariance matrix is:

Σ = 



  0.2499   −0.2401  −0.0098
−0.2401   0.2499  −0.0098
−0.0098   −0.0098  0.0196



 ÷ N
             (22)

The unit eivenvectors are:




−0.7071 
0.7071 



,
   



−0.4082 
−0.4082 
0.8165 



,
   



0.5774 
0.5774 
0.5774 



,
             (23)

And the corresponding eivenvalues are

    0.0098,     0.000588,    1.873e−19,
             (24)

So the oriented error bars are:




−0.07 
0.07 



,
   



−0.009899 
−0.009899 
0.0198 



,
   



2.499e−10 
2.499e−10 
2.499e−10 



,
             (25)

The first errror bar corresponds to a purely horizontal motion in figure 4, while the second corresponds to a purely vertical motion.

3.3  Probability of Seeing Zero

Suppose we survey N respondents, and none of them report being in favor of candidate X. We assume X actually has some nonzero favorability p in the global population, and we missed it due to sampling error. Let’s calculate the probability of that happening.

When a respondent does not favor X, we call it a miss. The chance of the first responding missing X is (1−p). There is some chance of getting N misses in a row, which we denote θ. We can calculate:

θ = (1−p)N
             (26)

As a numerical example, suppose that X has a true population-based probability p equal to half a percent, and suppose the sample-size is N=600. Then in 5% of the samples, there will be zero observations of X. For larger sample sizes, the probability of completely missing X goes exponentially to zero, but when p is small, N has to get quite large before the exponential really makes itself felt.

By turning the algebraic crank, we can solve equation 26 for N. That gives us:

N = 
ln(1/θ)
p
          
             (27)

assuming p is small compared to unity. Equation 27 is a moderately well-known result. Turning around the previous numerical example, if you want to run at most a 5% risk of missing X, then the numerator on the RHS of equation 27 is 3. (Remember that e cubed is 20, to a good approximation.) Therefore if the true population-based probability is half a percent, the sample size had better be at least 600.

Turning the crank in the other direction, one could perhaps obtain:

p = 
ln(1/θ)
N
 
             (28)

Alas this equation is open to misinterpretation, to say the least. The p in equation 28 probably doesn’t mean the same thing as the p in equation 27. In particular, it would be more-or-less conventional to interpret equation 26 as some sort of conditional probability of getting N misses, conditioned on θ and p:

Pr[N | θ, p] = ????
             (29)

In contrast, it would be more-or-less conventional to interpret equation 28 as some sort of conditional probability of p, conditioned on θ and N:

Pr[p | θ, N] = ????
             (30)

See section 4 for further discussion of the notational and conceptual problems with equation 28.

Before we can find a reasonable equation for p, we need to understand the prior distribution of p-values, and take that into account, presumably by use of the Bayes inversion formula (equation 33).

Here is an example of the sort of max-a-posteriori question we would like to ask: Given some small threshold θ, we would like to find some є that will allow us say with high confidence (1−θ) that the true probability p is less than є.

It is possible to answer such a question, but it’s more work than I feel like doing at the moment.

3.4  Lopsided Error Bars

Results such as equation 17 and equation 18 apply in the rather unlikely scenario where some oracle has told us the true value of the probability p in the general population. More generally, we can consider p to be the parameters of the model, and we are calculating the a priori probability:

P[data | model](⋯)  
             (31)

Note that this particular a priori probability is technically called the likelihood. Note that “likelihood” is definitely not an all-purpose synonym for probability. This is a trap for the unwary.

In the usual practical situations, we don’t know p, and we are trying to estimate it from the data. This corresponds to calculating the a posteriori probability:

P[model | data](⋯)  
             (32)

One way to calculate such this is via the Bayes inversion formula:

P[model | data](xy) := 
P[data | model](yx
P[model](x
P[data](y)
             (33)

The marginal P[data]() in the denominator on the RHS is the least of our worries. It serves as a normalization factor for the RHS as a whole. If it were the only thing we didn’t know, we could calculate it instantly, just by doing the normalization.

In contrast, the marginal P[model]() in the numerator on the RHS is very often not known, and not easy to calculate. In general, this is an utterly nontrivial problem. Sometimes it is possible to hypothesis a noncommittal “flat prior” on the parameters of the model, but sometimes not.

As a first step in the right direction, let’s take another look at equation 15 and equation 17. Rather than thinking of the observed data as a±b in terms of a known p, let us ask what values of p are consistent with the observed a.

We can use a itself as an estimator for the middle-of-the-road nominal value of p, by inverting equation 15.

pnom := a
             (34)

Note the subtle distinction; if we knew p a priori we would use equation 15 to calculate a. In contrast, here we start with an observed value for a, and then use equation 34 to estimate the nominal value of p.

Using the same logic, if we knew p a priori we might use equation 17 to calculate an error bar, disregarding correlations. In contrast, here we start with an observed value for a, and obtain an estimate for the top of the high-side error bar on a by solving the following equation:

a = 
phigh − 
phigh (1−phigh) / N
             
             (35)

where the length of the error bar is given by the second term on the RHS, i.e. the square-root term. The whole idea of «error bars» is conceptually flawed, but let’s ignore that problem for the moment. In equation 35, we are seeking the high-side error bar on a, which corresponds to the low-side error bar on phigh. Conceptually, we are searching through all possible phigh values to find one whose low-side error bar extends down far enough to reach a. The length of the error bar is calculated from equation 17 by plugging in the value of phigh ... emphatically not by plugging in the value of pnom.

Similarly the low-side estimate of p is found by searching through all possible values for plow to find one whose high-side error bar extends up far enough to reach a.

a = 
plow + 
plow (1−plow) / N
             (36)

Let’s turn the crank on the algebra:

phigh = 
a + 
phigh (1−phigh) / N
     
phigh − a = 
phigh (1−phigh) / N
     
phigh2 − 2phigh a + a2 = phigh (1−phigh) / N     
  = phigh / N − phigh2 / N     
(1+1/Nphigh2 +(−2a −1/Nphigh + a2 = 0         
             (37)

We finish the solution using the quadratic formula:

phigh = 
 
(2a + 1/N) +
(2a + 1/N)2 − 4 (1+1/Na2
2(1+1/N)
             (38)

If you want to increase the confidence level, you can find the kσ error bar (instead of the conventional 1σ error bar) by replacing N by N/k2 in all of these formulas.

The results are plotted in figure 6 and figure 7. The figures show the conventional 1σ error band in green, and show the 2σ error band in yellow. Also, for comparison, the dotted lines in the figures figures show the grossly naïve error bars used in the NPR article, namely the 3.8% error bars based on the 1/√N formula.

bernoulli-error-band   bernoulli-error-band-zoom
Figure 6: Bernoulli Process; Wilson Interval   Figure 7: Zoomed-In Wilson Interval
 

Some observations:

In the special case where a is very small, equation 37 simplifies to

phigh = 
1
N  (1+1/N)
   (for a=0)        
   1/N   (for a=0 and large N)        
             (39)

which is of the same form as equation 28, where the confidence level is θ=exp(−k2).

Note that when p is known, the uncertainty on a is zero when p=0, in accordance with equation 17 ... whereas when a is known, the uncertainty on p is definitely not zero, even when a is zero, in accordance with equation 39, assuming N is finite. This is an important and utterly nontrivial conceptual point.

For a=0, this reduces to equation 39. For a=0.5, it reduces to

phigh = 
0.5 + 0.5 
1 − 1/(1+1/N)
    (for a=0.5)
   0.5 + 0.5 / √N    (for a=0.5 and large N)
             (40)

which gives the same result as equation 19 in the large-N limit ... whereas for small N it is different.

When a=1, this reduces to

phigh = 
 
(2 + 1/N) +
(2 + 1/N)2 − 4 (1+1/N)
2(1+1/N)
   
  = 
 
(2 + 1/N) +
(4 + 4/N + 1/N2) − 4 −4/N
2(1+1/N)
       
  = 
 
(2 + 1/N) +
(1/N2)
2(1+1/N)
       
  = 1    (for a=1)
             (41)

which makes sense. The top of the error bar cannot ever be greater than p=1 ... and also it cannot be less than a, so for a=1 the error bar is quite well pinned down.

The calculations for plow are very similar, except we are interested in the other root of the quadratic. The results exhibit a mirror-image symmetry. You can easily verify that for a=0, the lower error bar is zero; for a=0.5 the lower error bar has length 0.5/√N, and for a=1 the lower error bar has length 1/N.

The formulas we have just calculated correspond to the Wilson interval (as discussed in e.g. reference 2). This is certainly not the most sophisticated way of defining the confidence interval ... but it is better than the 3.8% error bars that were blindly assumed in figure 1, better by at least two large steps in the right direction.

Note that the naïve notion that the length of the error bar “should” be inversely proportional to √N only works when the probability is near 50% ... and fails miserably otherwise, as we see in figure 2.

4  Philosophical and Pedagogical Remarks

  1. As the facetious proverb says:

    When all else fails,
    look at the data.
         

    Professional statisticians spend enormous amounts of time just staring at data. They cover the walls of their office with plots. If the first representation doesn’t tell the tale, they construct another, and another, and another.

    As a corollary: If you have three variables with a constraint, a ternary plot such as figure 4 might be just the ticket. It takes a few hours of staring at such things before you get good at interpreting them ... so don’t panic if it’s not 100% clear at first sight. See reference 4.

  2. You can do a lot with Monte Carlo. It’s not a substitute for thinking ... but Monte Carlo plus thinking is quite a bit better than either one separately.
  3. You can do a lot with SVD. A matrix in its undecomposed form is next to impossible to understand (unless it happens to be already diagonal) ... especially in more than two dimensions. SVD will diagonalize it for you. It’s astonishing how many postdocs there are running around, with physics degrees from Big Name institutions, who have no clue about SVD.
  4. Possibly constructive suggestion: I have not done a systematic survey, but I’ve had good luck with the armadillo package. It provides a thin layer of nice C++ bindings on top of the LAPACK linear algebra package. See reference 5.

    The code I used for this little project is given in reference 6 and reference 7.

  5. Some advice: It’s bad luck to cite a reference that you haven’t read. For example, the NPR article (reference 1) cites reference 2, which is better than not citing it ... but alas the article characterizes the more-sophisticated results as being “a little smaller” when in fact they are dramatically smaller, so much smaller as to contradict the main point of the article.
  6. If you want to complain that writing equation 28 violates several of the most sacred principles of algebra, I’d say you have a valid complaint. One of the rules of algebra is that a given symbol should have the same meaning every time it appears. Also there are rules about what the equals-sign means. The conventional notation for conditional probabilities violates these rules. It is terrible, and has been terrible for a very long time. It is practically begging to be misunderstood.

    It’s no wonder that people find probability and statistics to be confusing. The ideas aren’t particularly complicated, but the terminology stinks and the notation stinks.

5  Mutations

5.1  Systematic Error

For a finite-sized sample, there will always be sample-to-sample fluctuations ... the so-called sampling error. However, very commonly this is not the only source of uncertainty. There can be methodological mistakes, or other systematic errors.

For the sample presented in figure 1, it is conceivable that the total uncertainty could be on the order of 3.8% ... or it could be much smaller than that, or much larger than that.

In fact, there are eleventeen reasons to think that the data is horribly unreliable. When the none-of-the-above response has a higher frequency than the top two candidates combined, there is good reason to think that the respondents have not made up their minds. This source of uncertainty will not go away, no matter how large you make the sample-size N. If you pose a question to people who don’t know and don’t care, it doesn’t matter how many people you ask.

In any case, there is not the slightest reason to think that the systematic error will be proportional to the statistical uncertainty. It is just ridiculous to use a formula such as 1/√N. If it refers to the statistical uncertainty, it has no meaning, because it’s the wrong formula. If it refers to the total uncertainty, it has even less meaning.

5.2  Possible Origin of the Bogus «Sampling Error» Numbers

Pollsters are notorious for saying that their results have a «sampling error» equal to 1/√N, where N is the number of respondents surveyed. For example, reference 1 and figure 1 use value 1/√679 = 3.8%.

It must be emphasized that 1/√N is not a real mathematical result. To obtain such a formula, you would have to make four or five conceptual errors. If you arrange these errors just right and sprinkle fairy dust on them, you might kinda sorta obtain a numerically-correct answer under some conditions ... but never under the conditions that apply to figure 1 or figure 2.

  1. You have to assume that the basic concept of “error bars” make sense. (In reality, the concept is highly problematic when there are multiple variables involved, especially when the variables are correlated and/or when there is a non-Gaussian distribution.)
  2. You have to assume all the probabilities are comparable in magnitude. (This is clearly not the case in figure 1; If we take the data in the figure at face value, 5th place is greater than 15th place by a factor of infinity. Even if we discount for roundoff error, there is a factor of at least 19.)
  3. You have to assume that the main variable of interest is near 50%. (This is clearly not the case in figure 1.)

    Note that the previous two assumptions are not equivalent; indeed they are logically independent. The distribution {1/2, 1/2} satisfies both assumptions; the distribution {1/3, 1/3, 1/3} satisfies the first assumption but not the second; the distribution {1/2, 1/4, 1/4} might satisfy the second but not the first, and the distribution {0.7, 0.2, 0.1} satisfies neither.

  4. You have to ignore the fact that the variables are necessarily correlated. Since the variables are probabilities, there is a constraint that says they must add up to 100%. Therefore you cannot increase one variable without borrowing from another. This means any naïve attempt to calculate the so-called error bars is very likely to get the wrong answer.
  5. On the other edge of the same sword, any naïve attempt to use the so-called error bars is very likely to get the wrong answer, if you assume the probabilities are independent when in fact they are not. This does not cancel the previous error, and indeed exacerbates it.
  6. Pollsters commonly throw in a fudge factor, an ill-founded factor of 2 in the so-called “sampling error” formula. That is, they write «1/√N» even though everything we know about coin tossing, random walks, and the central limit theorem tells us that the mathematically correct formula should be 0.5/√N when the probabilities are near 50% ... and √(p(1−p)/N) more generally.

It is rather common for pollsters to take an interest in situations where there are two major candidates, both of whom receive approximately 50% support. In such a situation, assumption #2 and assumption #3 are valid, and do not count as mistake. Assumption #4 is a mistake, and introduces an erroneous factor of √2. Assumption #5 is another mistake, and introduces another factor of √2 in the same direction, if we imagine that the score for one candidate goes up while the score for the other candidate goes down. Item #6 miraculously cancels the two previous mistakes, possibly leading to a numerically-correct answer ... but only in this special case.

This result is conceptually unsound. In particular, the 1/√N is bogus, and obviously so. Just as obviously, it is a mistake to treat the two scores as statistically independent. You could equally well apply the same formula to a scenario where both candidates’ scores increase, which has zero probability. It is mathematically impossible. The only real possibility is that one candidate increases at the expense of the other. Any formula that assigns the same probability to the real scenario and the impossible scenario is obviously bogus.

In the multi-candidate scenario presented in figure 1, assumptions #1, #2, and #3 are disastrously wrong. A modestly better formula for the error bar in such a scenario is √(N pi (1−pi)). This result is derived in section 3.2. It reduces to the aformentioned special case 0.5/√N when pi is near 50%, but when pi is small the correction is quite substantial, as you can see by looking at the black error bars in figure 2.

6  References

1.
Danielle Kurtzleben,
“The Math Problem Behind Ranking The Top 10 GOP Candidates”
http://www.npr.org/sections/itsallpolitics/2015/05/29/410524780/the-gop-has-an-overcrowded-debate-problem

2.
Wikipedia article, “Binomial proportion confidence interval”
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

3.
John Denker,
“Introduction to Probability”
www.av8n.com/physics/probability-intro.htm

4.
John Denker,
“Psychrometric Charts, and the Evil of Axes”
www.av8n.com/physics/axes.htm

5.
Armadillo package, SVD routines
http://arma.sourceforge.net/docs.html#svd

6.
John Denker,
“Code for Modeling a Bernoulli Process”
./bernoulli-nway.c

7.
John Denker,
“Spreadsheet for Making Ternary Plots”
./ternary-plot.gnumeric
./ternary-plot.xls
[Contents]

Copyright © 2015 jsd