[Contents]

Numerically Well-Behaved Version

John Denker

## 1  The Smart Way

We wish to find the roots of the equation

 a x2 + b x + c = 0
(1)

That is, we wish to find values of x that satisfy the equation, for given values of the coefficients a, b, and c.

The formula in equation 2 is recommended. It is numerically well-behaved. As we shall see in figure 1, this version performs better than the unsophisticated “textbook” version in equation 7, better by many orders of magnitude in situations where one root is much larger than the other.

xbig =

 −b 2a
(1 +
 1 − 4 a c / b2
(2a)
xsmall =

 c/a xbig
(2b)

The names “small” and “large” describe the absolute magnitude of the roots:

 |xsmall| ≤ |xbig|
(3)

The rationale behind equation 2 is easy to understand:

• In equation 2a when we add 1 to the square root, both the 1 and the square root are always positive. Therefore this step of the calculation cannot possibly suffer from a “small difference between large numbers” problem. This stands in contrast to equation 7, which does suffer from this problem when calculating the small root, especially when it is much smaller than the big root.
• Equation 2b works because we know that c/a is the product of the roots.
• Within the square root there is a subtraction that could possibly produce a small difference between large numbers, in situations where the two roots are very nearly equal. In other words, the parabola is very nearly tangent to the contour of constant y=0 (which is commonly called the x-axis). This is an inherently tricky situation. Small changes in the a, b, and c coefficients can change the roots from real to complex. At least our numerical methods don’t make it worse.

The fundamental issue here is the fact that in a computer, floating point numbers are subject to roundoff. Roughly speaking, the roundoff error is on the order of the “machine epsilon”, which is not zero. (Typically there is considerable roundoff error in the 16th decimal place, although this varies somewhat from machine to machine.) There are lots of seemingly-innocuous real-world situations where this matters.

As a general rule, when the terms have the same sign, the sum of two terms in the denominator (as in equation 2b or equation 14) is numerically better behaved than the difference of two terms the numerator (as in equation 7 or equation 13). Vastly better.

It is a good habit to use equation 2 always, to the exclusion of less-clever formulas such as equation 7. (You can always use more-clever formulas, for instance in special cases where some of the coefficients are zero). You can get away with using equation 7 in situations where you know the two roots are a complex-conjugate pair, or are real and close together, that is, in situations where the discriminant b2−4ac is either negative or small compared to b2.

Note: When b is zero, the original equation is trivial. It can be solved by inspection:

x =
±
 −c/a
(4)

Similarly, when a is zero, the original equation is trivial; it’s not even a quadratic. So in equation 2 we can safely assume that a and b are nonzero.

Figure 1 compares the smart formula (equation 2) with the not-so-smart formula (equation 7) over a range of conditions. The true xbig is between 1 and 8, such that log(xbig) is uniformly distributed. The true xsmall is between 10−14 and 10−19, such that log(xsmall) is uniformly distributed. There is a range of many orders of magnitude where equation 2 produces the correct answers, but equation 7 produces wildly incorrect answers for xsmall. Cases where the incorrect answer is zero cannot be properly plotted on log-log axes, but are qualitatively indicated by downward-pointing triangles.

## 2  A Real-World Example: pH

### 2.1  The Scenario

Let’s see what happens in a real-world equation. Here is an approximate equation that comes up in chemistry, when calculating the pH of an acid solution:

 [H+]2 + Ka [H+] − Ka CHA = 0 approximately
(5)

where

 Ka = acid dissociation constant CHA = concentration
(6)

First, let’s see what happens if we try to solve that using the not-so-smart “textbook” version of the quadratic formula:

x =

b ±
 b2 − 4ac
2a

(7)

where in this case the variables are:

 a = 1 b = Ka c = −Ka CHA x = [H+]
(8)

Let’s do a numerical example, in the case where the acid is strong but moderately dilute:

 Ka = 5.666×104 CHA = 10−5
(9)

Let’s assume we arrived at the Ka value for our hypothetical acid by taking the average of various estimates. On this bases we estimate the uncertainty of Ka to be ±1×104 or even more. The uncertainty in the concentration is negligible by comparison.

### 2.2  Not-So-Smart Calculation

Plugging the Ka and CHA numbers into equation 7, we get

x =
−½(5.666×104 ±
 3.210 3556×109 + 2.27
)
(10)

Now some people might decide on the basis of «common sense» that the number inside the square root could be rounded off to 3.210×109. The uncertainty is so large that the sig-figs rules require us to round this number to a single digit, so carrying three extra digits «should» be plenty, or so the story goes. So let’s try rounding off and see what happens when we continue the calculation:

x =
−½(5.666×104 ±
 3.210×109
)

= −½(5.666×104 ± 5.6657×104)
{x1x2} = {−5.666×104, −1.569}
(11)

which is just completely wrong. Both of the alleged roots of the quadratic are negative. It is physically impossible for the [H+] concentration to be negative.

Analysis: It turns out that the «common sense» roundoff leading to equation 11 was a disaster. In this situation:

• The sig-figs rules require rounding to a single digit.
• Carrying three extra digits is not enough.
• Carrying eight extra digits is not enough.
• You need to carry at least 10 digits (9 decimal places) if you want a result with acceptable accuracy ... unless you change the method of calculation.

 All too often, it is hard to distinguish common sense from common nonsense.

### 2.3  Smart Calculation

So now let’s switch to using the smart method. Equation 2 gives us:

 x1 = −5.666×104 x2 = 1.000×10−5
(12)

where x2 is the desired answer. Rounding of intermediate results does not cause any problems. (This x2 value should not be taken as gospel, because equation 5 is only an approximation. It neglects the properties of the solvent. However, it’s in the right ballpark.)

It should have been obvious from the get-go that a 10−5 molar solution of a chemically-strong acid will have a pH of about 5.

## 3  Analogous Situation: Simple Square Root

### 3.1  Rational Solution

Let’s consider the equation

q =
1 −
 1 − z
(13)

where z is small. This comes up in connection with the quadratic formula, and also in special relativity, as discussed in reference 1. Although equation 13 is just fine if you are doing algebra, it is grossly unsuitable if you want to evaluate it numerically. This is because of the infamous “small difference between large numbers” problem. You are much better off using equation 14 instead; it is algebraically exact and numerically well-behaved for all z≤1.

q =
z
1 +
 1 − z
(14)

The reasoning behind equation 14 is the same as the reasoning behind equation 2, as discussed in section 1.

### 3.2  Taylor Series

Let’s investigate another way of dealing with equation 13. You could expand the square root using a first-order Taylor series, namely:

 1−z
= 1 − ½ z + ⋯
(15)

whenever z is small compared to 1. This would give you a reasonably accurate answer if |z| is small enough. On the other hand, there is no real advantage to equation 15, because equation 14 is just as convenient and is less restricted.

If you want more accuracy than is provided by a first-order Taylor series, you should not assume that the best way forward is to use a higher-order Taylor series. Often there are other numerical methods that are better behaved. That is, they converge more quickly, giving higher accuracy with less work.

## 4  Analogous Situation: Spactime Kinetic Energy

For an interesting application of these ideas, see reference 1.

## 5  References

1.
John Denker,
“Spacetime Kinetic Energy – An Exercise in Numerical Methods”
www.av8n.com/physics/spacetime-kinetic-energy.htm
[Contents]