_ [Contents]

Copyright © 2008 jsd

1  Data Analysis

There is always a question of how much to preprocess the data before modeling it.

In particular, given data that is intrinsically nonlinear, there is often a question as to whether one should

A linear representation has one huge advantage: When looking at a graph, the human eye is good at seeing linear relationships, and in detecting deviations from linearity.

However, there are serious disadvantages to preprocessing the data. The disadvantages usually outweigh any advantage. The devil is in the details. There are no easy answers, except in rare special cases.

Here’s a typical example: Suppose you are using a pressure gauge to monitor the amount of gas in a container. Assume constant temperature and constant volume. The gas is being consumed by some reaction. Suppose we have first-order reaction kinetics, so the data will follow the exponential decay law, familiar from radioactive decay.

All pressure gauges are imperfect. There will be some noise in the pressure reading. So the raw data (P) will be of the form:

P = exp(−t) ± noise              (1)

During the part of the experiment where the reading is large compared to the noise, there is an advantage to plotting the data on semi-log axes, which results in a straight line that is easy to interpret.

However, during the part of the experiment where the readings have decayed to the point where the noise dominates, linearizing the data is a disaster. You will find yourself taking the logarithm of negative numbers.

In the borderline region, where the readings are only slightly larger than the noise, you’ve still got problems. The data will be distorted. The attempt to describe it by a straight line will be deceptive. The effect may be subtle or not so subtle. Figure 1 shows an example:

lopsided-error-bars
Figure 1: Lopsided Data

The data looks like the proverbial hockey stick. It is straight at first, then it bends over.

Alas, appearances can be deceiving. This data actually comes from a simple exponential decay, plus noise. You might think it “should” be linear on semi-log axes. In figure 2, the red line shows what the data would look like in the absence of noise.

lopsided-error-bars-line
Figure 2: Lopsided Data, With Ideal Model Line

Because of the nonlinearity of the logarithm, points that fall above the line are only slightly above the line, while points that fall below the line are often far (possibly infinitely far) below the line.

In both of these figures, the red triangles near the bottom are stand-ins for points that are off-scale. In some sense they are infinitely off-scale, since they correspond to the logarithm of something less than or equal to zero.

In general, it pays to be careful about off-scale data. The proverb about “out of sight, out of mind” is a warning. Plotting stand-ins, as done here, is one option. It doesn’t entirely solve the problem, but it at least serves to remind you that the problem exists.

For more about the fundamental notions of uncertainty that underlie any notion of error bars, see reference 1.

Constructive suggestion: In this situation, I recommend

  1. Fit the raw data to an exponential. That is, don’t linearize the data; do a nonlinear fit to the unmodified data.
  2. Calculate the residuals, i.e. subtract the model from the raw data. Plot the residuals. They should be evenly distributed around zero, so there should be no temptation to plot them on funny axes. If the model is imperfect, you will see it here, as some sort of non-random trend in the residuals.

Bottom line: There are no easy answers. Relying on the human eye to analyze data is a losing proposition.

2  References

1.
John Denker, “Measurements and Uncertainties versus Significant Digits or Significant Figures” ./uncertainty.htm
[Contents]

Copyright © 2008 jsd

_