To solve quadratic equations, the formula in equation 1 is numerically well-behaved. As we shall see in figure 1, this version performs much better than the unsophisticated “textbook” version in equation 8, especially in situations where there might be one big root and one small root.
where the function sgnR(b) is defined according to equation 3. The names “small” and “large” describe the absolute magnitude of the roots:
The rationale behind equation 1 is easy to understand:
The root cause behind this issue is the fact that in a computer, floating point numbers are subject to roundoff. Roughly speaking, the roundoff error is on the order of the “machine epsilon”, which is not zero. There are lots of seemingly-innocuous real-world situations where this matters.
As a general rule, when the terms have the same sign, the sum of two terms in the denominator (as in equation 1b) is numerically better behaved than the difference of two terms the numerator (as in equation 8). Vastly better.
It is a good habit to use equation 1 always, to the exclusion of less-clever formulas such as equation 8 (except in the trivial case where both b and c are zero). You can get away with using equation 8 in situations where you know the two roots are a complex-conjugate pair, or are real and close together, that is, in situations where the discriminant b2−4ac is either negative or small compared to b2.
The function sgnR(b) is the right-continuous signum function, defined according to equation 3. The name “signum” is Latin word for “sign” (as in positive sign or negative sign) but we pronounce it “signum” so it doesn’t rhyme with “sine”.
You could equally well use the left-continuous signum function:
which would just interchange the roles of xbig and xsmall in situations where it doesn’t matter, i.e. in cases where they have the same magnitude, because b is zero. In such cases it’s hardly worth bothering with equation 1 anyway, since we can solve the quadratic by inspection; the solution is:
The plain old signum function is nice and symmetrical, but it must not be used in equation 1, since it doesn’t do what we want when b is zero:
Figure 1 compares the smart formula (equation 1) with the not-so-smart formula (equation 8) over a range of conditions. The true xbig is between 1 and 8, such that log(xbig) is uniformly distributed. The true xsmall is between 10−14 and 10−19, such that log(xsmall) is uniformly distributed.
There is a range of many orders of magnitude where equation 1 produces the correct answers, but equation 8 produces wildly incorrect answers for xsmall. Cases where the incorrect answer is zero cannot be properly plotted on log-log axes, but are qualitatively indicated by downward-pointing triangles.
By way of contrast, let’s see what happens if we try to solve a real-world equation. Here is an equation that comes up in chemistry, when calculating the pH of an acid solution:
Let’s see what happens if we try to solve that using the “textbook” version of the quadratic formula.
where in this case the variables are:
Let’s do a numerical example, in the case where the acid is strong but moderately dilute:
We are talking about a hypothetical acid. Let’s assume we arrived at the Ka value by taking the average of various estimates. There is a huge amount of uncertainty in the resulting Ka value, easily ±1×104 or even more. The uncertainty in the concentration is negligible by comparison. Plugging the Ka and CHA numbers into equation 8, we get
Now some people might decide on the basis of «common sense» that the number inside the square root could be rounded off to 3.210×109. The uncertainty is so large that the sig-figs rules require us to round this number to a single digit, so carrying three extra digits «should» be plenty, or so the story goes. So let’s try rounding off and see what happens when we continue the calculation:
which is just completely wrong. Both of the alleged roots of the quadratic are negative. It is physically impossible for the [H+] concentration to be negative.
Analysis: It turns out that the «common sense» roundoff leading to equation 12 was a disaster. In this situation:
Let’s consider the equation
where z is small. This comes up in connection with the quadratic formula, and also in special relativity, as discussed in reference 1. Although equation 13 is just fine if you are doing algebra, it is grossly unsuitable if you want to evaluate it numerically. This is because of the infamous “small difference between large numbers” problem. You are much better off using equation 14 instead; it is algebraically exact and numerically well-behaved for all z≤1.
The reasoning behind equation 14 is the same as the reasoning behind equation 1, as discussed in section 1.
Let’s investigate another way of dealing with equation 13. You could expand the square root using a first-order Taylor series, namely:
whenever z is small compared to 1. This would give you a reasonably accurate answer if |z| is small enough. On the other hand, there is no real advantage to equation 15, because equation 14 is just as convenient and is less restricted.
If you want more accuracy than is provided by a first-order Taylor series, you should not assume that the best way forward is to use a higher-order Taylor series. Often there are other numerical methods that are better behaved. That is, they converge more quickly, giving higher accuracy with less work.
For an interesting application of these ideas, see reference 1.