Data Provenance and Graph Interpretation

* Contents

1 How to Read the Graphs
2 Objective
3 Caveats
4 Growth Rate in Logarithmic Units: cNp = centineper
5 Data Sources

1 How to Read the Graphs

Figure 1: US Coronavirus Daily Deaths

This diagram shows four main quantities of interest:

The cumulative death toll to date. The raw data is represented by the area under the curve.
The daily death toll, i.e. the number of deaths per day. This is the rate-of-change of the previous item, i.e. the rate at which people are dying. This is represented by the height of the curve, namely the circular points connected by lines.
An 8 parameter fit to the data. There is one parameter for each day of the week, plus an overall parameter describing the exponential acceleration. The fit is shown by the shaded region. You can see that the fit agrees decently (albeit not perfectly) with the raw data.
The unmodulated fit, shown by the cyan (light blue) line. This is a pure exponential, treating every day of the week the same.

Note: The just-mentioned items can be normalized in absolute terms (deaths) or in per-capita terms (deaths per million population). The logarithmic derivative is the same no matter what normalization you use.

The acceleration is the rate-of-change of the height of the curve (i.e. the daily death rate). In other words, it is the rate-of-change of the rate-of-change of the area under the curve (i.e. the cumulative death toll). This is calculated by fitting an exponential, so the acceleration reflects not absolute change but rather the relative change, which is very nearly the percentage change. In mathematical terms, the acceleration is the logarithmic derivative of the death rate. It is measured in cNp per day. (One cNp is very nearly one percent. See section 4 for an explanation.)

A negative acceleration is synonymous with a deceleration.

Curve-fitting is a powerful technique for averaging out the meaningless day-to-day variations. The fitted exponential is easier to interpret than the raw data, as discussed in section 3.2.

2 Objective

The main purpose in looking at graphs like this is to help decide what’s good policy and what’s bad policy. When doing so, keep in mind that (a) there are other sources of information (notably observations of countries that have been successful in suppressing the virus), and (b) this data is imperfect, as we now discuss.

3 Caveats

3.1 The Data is Not Timely

The data is seriously delayed, and arrives in batches. Specifically:

Deaths are a lagging indicator. After people are exposed, it takes them a while before they get sick enough to die.
Deaths are not promptly reported. Most deaths that have occurred in the last week have not been reported at all. This is even worse than noise, because it introduces bias into the data.
Reports are sketchy to nonexistent on weekends. On Mondays, we see the reports from Sunday, which are anomalously low. On Tuesdays and perhaps Wednesdays, they play catch-up.

3.2 The Data is Noisy

Imperfect data is better than no data. Everybody (including scientists and everybody else) makes decisions based on imperfect data all day every day. We should not over-react or under-react to the presence of noise in the data.

A certain amount of the noise is Poisson noise, which is inescapable because we are (thank heavens) working with smallish numbers.

Fitting an exponential to the data is a very powerful way of averaging out the short-term noise. As long as public-health policy doesn’t change, and other “facts on the ground” don’t change, we expect the daily death rate to follow a more-or-less exponential trend. As a corollary, we expect the acceleration (i.e. the logarithmic derivative of the death rate) to be constant. Similarly, we expect the cumulative death toll to be an exponential offset by some constant.

3.3 Other Types of Data

Data on the number of “cases” is more timely than death data. However, since the beginning of the pandemic I have given very little weight to this, because finding “cases” relies on testing, and the testing is abysmally inadequate.
An infection doesn’t become a “case” until it is diagnosed.
I look forward to the day when the the case load is small enough and the testing is vigorous enough so that nearly all infections are detected, but we’re not there yet. We’re not even on a path to get there any time soon.
Data on current hospitalizations is more timely than death data. However, it contains a certain amount of arbitrariness. It will underestimate the problem at the worst possible moment, when there is not enough capacity to hospitalize everyone who needs to be hospitalized.

3.4 Extrapolate at Your Own Risk

You could kinda maybe sorta get away with extrapolating the nationwide trend back in early March, when there was more-or-less one big outbreak. But not any more. Now we have hundreds of smaller outbreaks. The only way to make sense of the situation is to model each outbreak separately, and then add the results.

Some of the outbreaks are large but under control, decreasing exponentially.
Some of the outbreaks are small (for now) but out of control, growing exponentially.
Some of the outbreaks are in a stalemate situation, with a near-zero growth rate, neither increasing nor decreasing significantly at the moment.

Let’s be clear: In general, it is difficult to extrapolate the nationwide or even statewide trends.

Partly that’s because there is some uncertainty (i.e. some variance) in the parameters of the best-fit exponential. This is unavoidable, because the data is noisy. This means that an extrapolation might mean something in the short run but not the long run.
Partly it’s because the simple exponential model is biased. It is not safe to assume we have a single exponential rather than a sum of exponentials. Therefore, unless circumstances change, unless the facts on the ground change, a simple exponential extrapolation is likely to be a best-case scenario.
When you have a sum of exponentials, you might have a contribution that starts out small but increasing. It will be negligible until suddenly it isn’t. In contrast, there is virtually never a fair godmother that comes along and makes things better unexpectedly.
To repeat: There is usually only one way that things will work out better than a simple extrapolation would indicate (better outside the uncertainty band). That’s if there is a beneficial change in behavior, a change in treatment options, or the like.

4 Growth Rate in Logarithmic Units: cNp = centineper

One centineper (abbreviated cNp) corresponds to 1% when the changes are small. When the changes are large, centinepers behave much better. It’s the difference between compound interest and simple interest. A more detailed discussion with examples and graphs is available.

5 Data Sources

Global data (countries) and US nationwide data (states and counties) is curated by Hopkins CSSE. I download it from there.
Arizona data comes from scraping the AZDHS site every morning. (It is a few hours more up-to-date than the Hopkins repository.)
The visualization and interpretation are my own work.