Copyright © 2008 jsd

Data Analysis

1  Twenty Questions

In order to illustrate some fundamental ideas that are a prerequisite for any understanding of data analysis, let’s analyze the well-known parlor game called “Twenty Questions”.

The rules are simple: One player (the oracle) picks a word, any word in the dictionary, and the other player (the analyst) tries to figure out what word it is. The analyst is allowed to ask 20 yes/no questions.

When children play the game, the analyst usually uses the information from previous questions to design the next question. Even so, the analyst often loses the game.

In contrast, an expert analyst never loses. What’s more, the analyst can write down all 20 questions in advance, which means none of the questions can be designed based on the answers to previous questions. This means we have a parallel algorithm, in the sense that the oracle can answer the questions in any order.

I’ll even tell you a set of questions sufficient to achieve this goal:

1) Is the word in the first half of the dictionary?
2) Is the word in the first or third quarter?
3) Is it in an odd-numbered eighth?
*) et cetera.

You can understand this as follows: There are only about 217 words in the largest English dictionaries. So if you use the questions suggested above, after about 17 questions, you know where the word sits on a particular page of the dictionary.

Note that if you ask only 15 questions about a dictionary with 217 entries, you will be able to narrow the search down to a smallish area on the page, but you won’t know which of the words in that area is the right one. You will have to guess, and your guess will be wrong most of the time. The situation is shown in figure 1.

Figure 1: Correctness versus Information

You can see that it is important to choose the questions wisely. If you use non-incisive questions such as

1) Is the word “aardvark”?
2) Is the word “abacus”?
3) Is the word “abaft”?
*) et cetera.

it would require, on average, hundreds of millions of questions to find the right word.

This can be seen as an exercise in hypothesis testing, of the sort discussed in reference 1. We have hundreds of millions of hypotheses, and the point is that we cannot consider them one by one. We need a testing strategy that can rule out huge groups of hypotheses at one blow. Another example (on a smaller scale) can be found in reference 2. Additional examples can be found in reference 3 and reference 4.

2  Fitting versus Overfitting

Here’s a parable. Once upon a time, a students were doing a Science Fair project. They decided to measure the effect that a certain chemical had on the growth of plants. They tried six different concentrations. They hypothesized that the effect would be well described by a polynomial. And sure enough, it was. Figure 2 shows the six data points, with a least-squares fit to a fifth-order polynomial:

Figure 2: Maximum Likelihood Fit – Overfitting

As is the case with least-squares fitting in general, this is a maximum likelihood analysis. That is to say, it calculates the a priori conditional probability of getting this data, given the model, i.e.

likelihood = P(data | model)              (1)

and we choose the model to maximize the likelihood. Note that we are using the word likelihood it a strict technical sense here.

We emphasize that likelihood, in the technical sense, is defined by equation 1. The likelihood is also called the a priori probability ... which stands in contrast to the a posteriori probability, as given by equation 2:

apost = P(model | data)              (2)

which is emphatically not the likelihood.

This is important because in the context of data analysis, MAP is what we want, where MAP stands for maximum a posteriori as defined by equation 2. It is a scandal that almost everybody uses maximum likelihood instead of MAP. Maximum likelihood is easier, so this can be considered a case of “looking under the lamp-post” even though the thing you are looking for is surely somewhere else.

Returning to our students’ research, we can say that the polynomial fit is in fact a maximum likelihood solution, and we can also say that their data is entirely consistent with their model and consistent with their hypothesis.

On the other hand, common sense tells us that whatever this is, it isn’t what we want. Maximum likelihood isn’t sufficient (or necessary). Being consistent with the model isn’t sufficient. Being consistent with the hypothesis isn’t sufficient.

Science is about making predictions, and the students’ polynomial is a spectacularly bad predictor. Indeed

y = 0              (3)

is a better predictor of future data than

y = polynomial(x)              (4)

would be, for the fitted polynomial or any similar polynomial.

You may say that this parable is unrealistic, because nobody would be so silly as to fit six data points with a fifth order polynomial. There is a rule against it.

More generally, there is a rule against overfitting.

Well, yes, such rules exist. But rather than learning those rules by rote, we would be much better off understanding the principles behind the rules.

To a first approximation, the principled analysis starts from the observation that there is almost never a single hypothesis but rather a family of hypotheses.

3  Preprocessing, or Not

There is always a question of how much to preprocess the data before modeling it.

In particular, given data that is intrinsically nonlinear, there is often a question as to whether one should

A linear representation has one huge advantage: When looking at a graph, the human eye is good at seeing linear relationships, and in detecting deviations from linearity.

However, there are serious disadvantages to preprocessing the data. The disadvantages usually outweigh any advantage. The devil is in the details. There are no easy answers, except in rare special cases.

Here’s a typical example: Suppose you are using a pressure gauge to monitor the amount of gas in a container. Assume constant temperature and constant volume. The gas is being consumed by some reaction. Suppose we have first-order reaction kinetics, so the data will follow the exponential decay law, familiar from radioactive decay.

All pressure gauges are imperfect. There will be some noise in the pressure reading. So the raw data (P) will be of the form:

P = exp(−t) ± noise              (5)

During the part of the experiment where the reading is large compared to the noise, there is an advantage to plotting the data on semi-log axes, which results in a straight line that is easy to interpret.

However, during the part of the experiment where the readings have decayed to the point where the noise dominates, linearizing the data is a disaster. You will find yourself taking the logarithm of negative numbers.

In the borderline region, where the readings are only slightly larger than the noise, you’ve still got problems. The data will be distorted. The attempt to describe it by a straight line will be deceptive. The effect may be subtle or not so subtle. Figure 3 shows an example:

Figure 3: Lopsided Data

The data looks like the proverbial hockey stick. It is straight at first, then it bends over.

Alas, appearances can be deceiving. This data actually comes from a simple exponential decay, plus noise. You might think it “should” be linear on semi-log axes. In figure 4, the red line shows what the data would look like in the absence of noise.

Figure 4: Lopsided Data, With Ideal Model Line

Because of the nonlinearity of the logarithm, points that fall above the line are only slightly above the line, while points that fall below the line are often far (possibly infinitely far) below the line.

In both of these figures, the red triangles near the bottom are stand-ins for points that are off-scale. In some sense they are infinitely off-scale, since they correspond to the logarithm of something less than or equal to zero.

In general, it pays to be careful about off-scale data. The proverb about “out of sight, out of mind” is a warning. Plotting stand-ins, as done here, is one option. It doesn’t entirely solve the problem, but it at least serves to remind you that the problem exists.

For more about the fundamental notions of uncertainty that underlie any notion of error bars, see reference 5.

Constructive suggestion: In this situation, I recommend

  1. Fit the raw data to an exponential. That is, don’t linearize the data; do a nonlinear fit to the unmodified data.
  2. Calculate the residuals, i.e. subtract the model from the raw data. Plot the residuals. They should be evenly distributed around zero, so there should be no temptation to plot them on funny axes. If the model is imperfect, you will see it here, as some sort of non-random trend in the residuals.

Bottom line: There are no easy answers. Relying on the human eye to analyze data is a losing proposition.

This document is a stub. There is much more that could be said about this.

4  References

John Denker,
“How to Define Hypothesis” www.av8n.com/physics/hypothesis.htm

John Denker,
“The Twelve-Coins Puzzle”

John Denker,
“Learning, Remembering, and Thinking”

For a general discussion of what entropy is, see:

John Denker,
“Measurements and Uncertainties versus Significant Digits or Significant Figures”

Copyright © 2008 jsd