Suppose we have a polynomial of the form
| (1) |
It is standard good practice to evaluate it using Horner’s method. That is, start out with a. Multiply by x and add b. Then multiply by x and add c. Finally multiply by x and add d. In other words:
| (2) |
This requires only d multiplications, where d is the degree of the polynomial ... in contrast to direct evaluation of equation 1, which requires d(d+1)/2 multiplications. So Horner’s method results in a substantial speedup when d is large.
Perhaps more importantly, it results in better accuracy. Fewer operations means less accumulation of roundoff errors.
Many late-model desktop computers implement the fma instruction in hardware. That stands for Fused Multiply Add. So to a first approximation we can say:
| (3) |
and we can rewrite the Horner method as:
| (4) |
Here’s where it gets interesting: fma has only about half as much roundoff error.
The explanation is simple: If the multiplication and addition were done separately, there would be two roundoff errors: (a) one after the multiplication, and (b) one after the addition.
In contrast, with fma, the result of the multiplication is never delivered to the user. There is no need to represent it in standard floating-point format, and in particular no need to round it. In fact, it is done with plenty of guard digits, so there is no there is no roundoff error at step (a). The only roundoff occurs later, at step (b), after the addition, so the final result can be delivered to the user.
As a possible side-benefit, it is sometimes true that the fma instruction is faster than doing the multiplication and addition separately. This is not necessarily the case, because modern cpus are smart about pipelining ... but sometimes it is true. Typically there is a modest speedup: somewhat faster, but not 2× faster.
The fma() function implemented by C++ does not map directly onto the fma hardware instruction. Instead it calls a library function which determines at runtime whether the hardware implements the instruction. If not, the library carries out the required calculation in software. It assumes you want the accuracy and you don’t care about speed. In fact, the last time I checked, the software emulation was excruciatingly slow, about 100× slower than plain old multiplication. There is no excuse for this. Efficient ways of emulating fma are known. However, it is what it is.
If you are sure your hardware supports fma, you can bypass the runtime library entirely. There is a compiler option that tells it to inline the hardware fma instructions. (If you try to run this on a machine without the fma instruction, your program will raise the illegal instruction signal.)
So here’s how I usually code it: For portability, I write FMA(y,x,a). Depending on the situation, that can be defined in terms of the hardware fma instruction, or defined in terms of plain old multiplication and addition:
// if you care more about speed than utmost accuracy: #ifdef FP_FAST_FMA #define FMA(y,x,a) fma(y,x,a) #else #define FMA(y,x,a) (y*x+a) #endif
The fma idea works for dot products, too. See section 2.
Compensation means calculating how big the roundoff error must have been, and then correcting for it. For a single operation, that’s not very exciting, but when dealing with a high degree polynomial it’s a big deal.
The rule of thumb is that compensated fma using doubles has the same accuracy as regular uncompensated fma using double doubles. It’s slower than uncompensated fma, but dramatically faster than double doubles.^{1}
These instructions are older than dirt, and rather widely available.
Also rather widely available.
Suppose we want to evaluate the following indefinite integral:
| (5) |
Here is the “textbook” answer:^{2}^{,}^{3}
| (6) |
which is not the formula that I use in my software. There are *multiple* things wrong with it.
The question is perfectly well defined for α=0, with a nontrivial result, and doing the integration correctly is super-simple.
The case where α=0 is perfectly reasonable and can occur in the real world. Examples include:
If you are writing a routine to evalute the definite integral, you would be wise handle the case of α=0. I am surprised and disappointed that fancy mathematical systems like wolframalpha and maxima screw this up.
The lessons here are:
| (7) |
This works pretty well; in contrast, the difference of exponentials is a disaster.
Sanity check: Equation 7 treats t_{1} and t_{2} on the same footing.
Sanity check: When α goes to zero, I goes to τ, as it should.
Sanity check: When τ goes to zero, I goes to zero, as it should.
Calculating sinh(x)/x is probably good enough for all nonzero x, even when |x| is quite small, but we can do better. Since we have to make a special check for the case of x=0, we might as well be smart about x near zero.
Specifically: For x near zero, we can use the first two terms of the Taylor series.
| (8) |
We set the threshold at 10^{−4} for the following reasons:
// numerically accurate sinh(arg)/arg template<class T> inline T sinhc(T arg) { if (abs(arg) < 1e-4) return 1 + arg*arg/6; return sinh(arg)/arg; } // ... inner loop ... { double mid((t2 + t1) / 2.); double tau(t2 - t1); rslt += amp * exp(-rate*mid) * tau * sinhc(rate*tau/2.); }
For example, suppose we want to construct models of the current public health emergency. The state and county have the timestamped data, but they won’t release it. Instead they grudgingly release data in bins of size one day, or sometimes two days or three days, depending on their whims. It would be easier for them and everybody else if they would just release the underlying data, but they won’t.
We have to deal with the data that’s available, but we don’t have to like it.
Similarly, if the exponent is too hugely positive, the exponential will overflow and return “inf” (which does have a floating point representation).
Both of these outcomes are probably just fine in practical cases, but if you want mathematical exactitude you can wrap equation 7 in code that checks for underflow or overflow and throws an exception. Or you can turn on the FPU traps and let the hardware do the checking for you. It might thrown exceptions from places you never thought to check.