[Contents]
Generating a Random Distribution

John S. Denker

ABSTRACT: We discuss how to design a so-called Random Number Generator (RNG). More precisely, we are interested in producing a random distribution of symbols (numerical or otherwise). Such distributions are needed for a wide range of applications, including high-stakes adversarial applications such as cryptography and gaming. You can do a lot better using physics and algorithms together than you could with either one separately.

A fundamental building-block is a so-called Hardware Random Number Generator (HRNG), which closely approximates an ideal True Random Number Generator (TRNG). It starts with a raw signal derived from some fluctuating physical system. It then processes that signal to remove biases and correlations. The hash saturation principle – basically a pigeonhole argument – can be used to prove that the output has the desired statistical properties.

It is essential that the underlying physical system be calibrated. We require a strict lower bound on the amount of thermodynamic unpredictability, not merely an estimate or an upper bound. The lower bound must come from the laws of physics, since statistical tests provide only an upper bound. The details of the calibration process will vary from system to system. A remarkably cost-effective example uses standard audio I/O hardware, as discussed in reference 1.

The output of the HRNG can be used directly, or can be used to seed a so-called Pseudo Random Number Generator (PRNG).

róbungy

## 1  Introduction; Basic Notions

There is a wide range of important applications that depend on being able to draw samples from a random distribution. See section 2.3 for an overview of some typical applications. Some of the things that can go wrong are discussed in section 6.1.

### 1.1  Outline of the Design

Here is one way to construct a so-called Hardware Random Number Generator (HRNG). We want it to closely approximate an ideal True Random Number Generator (TRNG) – suitable for a wide range of applications, including extremely demanding ones.

We start by introducing some key theoretical ideas:

• The HRNG contains some physical device that exhibits fluctuations. An example is the Johnson noise in an electronic circuit, as discussed in reference 1. This is measured to produce a raw input signal.
• We obtain a reliable lower bound on the raw input’s unpredictability (as defined in appendix B). The lower bound is based on fundamental physics principles plus measurements of the macroscopic properties of the hardware – robust, non-statistical, easily-measured properties. (This stands in stark contrast to other approaches, which obtain a loose upper bound based on statistical tests on the data.)
• We process the signal to remove biases and correlations. The result is an output that is very nearly 100% random, by the most conservative measure, as explained below. This is provably correct under mild assumptions.
• The HRNG does not require any persistent internal state and therefore does not require a seed.

### 1.2  How to Define Randomness, or Not

It must be emphasized that there is no such thing as a random number.

• If it’s random, it’s not a number.
• If it’s a number, it’s not random.
• You can have a random distribution over numbers, but then the randomness is in the distribution, not in any particular number that might have been drawn from such a distribution.

Figure 1 illustrates this idea. Consider the point at the center of the small black circle. The point itself has no uncertainty, no width, no error bars, and no entropy. The point could have been drawn from the red distribution or the blue distribution. You have no way of knowing. The red distribution has some width and some entropy. The blue distribution has some other width and entropy.

Figure 1: One Point, Two Distributions

Terminology: Therefore a so-called «random number generator» (RNG) must not be thought of as a generator of random numbers; rather, it is a random generator of numbers. Furthermore, it often useful to generate a random distribution over non-numerical symbols, in which case the term RNG should be interpreted to mean randomness generator or something like that.

Terminology: If you ask three different people, you might get six or seven different definitions of “random”.

At a minimum, we need to consider the following ideas, which are diagrammed in figure 2.

1. In one corner of the parameter space, we can imagine perfect randomness, by which I mean 100% unguessability, such as a source that produces 8 bits of unguessability in each 8-bit byte. There is no pattern whatsoever. Our notion of unguessabililty is formalized and quantified in section 1.7. It is defined in terms of statistics, without reference to any notion of computational feasibility.
2. In another corner of the parameter space, we can imagine near-perfect pseudo-randomness, which means that there is a pattern, but it is computationally infeasible for the adversary to discover the pattern. For a wide range of purposes – but not all – the adversary finds this to be just as unpredictable as true, perfect randomness.
3. In yet another corner, we can imagine perfect determinism, i.e. no randomness at all, such as a source that produces an endless string of bytes all the same, or perhaps successive digits of π. The pattern is obvious to everybody, including our adversary.
4. In still another corner, there is something I call squish, which is neither reliably predictable nor reliably unpredictable.

For example: Once upon a time I was working on a project that required a random distribution. The pointy-haired boss directed me to use the saved program counter at each interrupt. I had to explain to him that it was an event-driven system, and therefore the saved PC was almost always a constant, namely the address of the null job. Of course it was not reliably a constant, but it was also not reliably unpredictable. There was no positive lower bound on the amount of randomness. In other words, it was squish.

5. We can have combinations of the above. For example, there could be a distribution over 32-bit words having a few bits of unpredictability in each word, with the other bits being deterministic and/or squish. This is represented by the enormous shaded area in figure 2.

Figure 2: Predictability, Unpredictability, and Squish

The terminology is not well settled. I have heard intelligent experts call items 1, 2, 4, and 5 on the previous list “random”. I have also heard them call items 3, 4, and 5 “non-random” or “not really random”. Sometimes even item 2 is included in the catagory of “not really random”.

• Black-versus-white extremism is unhelpful. I’ve never seen a system that was «perfectly» random, based on physics alone. On the other edge of the same sword, it is infamously impossible to have a system that is «perfectly» pseudo-random, based on algorithms alone, as von Neumann pointed out in 1947. As discussed in section 1.3, you can do a lot better with physics and algorithms together than you can with either one separately.

As you can see in figure 2, there is a continuum of possibilities along the line from ideal TRNG to PRNG ... and a lot of other possibilities besides.

• Something that is “random enough” for one application might be not nearly random enough for another application. It is always good to ask What’s Your Threat Model (WYTM) and on the other side of the same coin, What’s Your Use Case (WYUC).
• We consider the terms «random» and «non-random» to be vague, clumsy, and prone to misunderstanding. We will use such terms only when the details aren’t important.
• For careful work, the best approach is to focus attention on the distribution. We can use the language of probability and statistics to characterize the distribution. We don’t need to know every detail of the distribution; generally a few numbers suffice to tell us what we need to know. For example:
• We would like the samples to be Independent and Identically Distributed (IID). They are never completely so, but that’s OK if we have a low-enough upper bound on the correlations.
• The entropy is one way of summarizing the distribution. It tells us the average unpredictability.
• There are also the Rényi functionals and other ways of measuring the unpredictability, as discussed in section 1.7.

### 1.3  Physics, Computation, or Both

Figure 3 is a close-up, top-down view of the rightmost part of figure 2.

Figure 3: Three Types of RNG

Let’s discuss the three labeled points: A, B and C.

Point A in the diagram represents a hardware-only RNG. An example is a roulette wheel; a few other examples are mentioned in section 1.5. No fancy computer algorithms are involved.

Such things can be made good enough for some purposes. The house take for an ideal roulette wheel is a few percent. From the house’s point of view, if the wheel is skewed by more than that, it is a disaster for the house, as discussed in reference 2. In contrast, a small amount of skew (perhaps a few parts per million) is not worth fixing. Indeed the house will sometimes intentionally give away a percent or two for marketing purposes; In Las Vegas, for example, high-stakes players may be offered a single-zero wheel, or offered partage. From the player’s point of view, a tiny skew in the wheel statistics is not worth exploiting; it would cost a fortune to measure the statistics sufficiently carefully, and would detract from whatever entertainment value the game offers.

Alas, a device that is good enough for gambling may not be good enough for cryptography. Imagine using a roulette wheel to directly generate one-time pads. If you send a large number of messages with such a system, it is possible that your adversary may be able to exploit the statistical flaws.

You could try to make a better hardware-only RNG, moving rightward from point A in the diagram ... but in many cases it is cheaper and better to move diagonally, from point A to point B.

Point B in the diagram represents a hybrid system, relying on physics for most of its unpredictability, but using a cryptographically-strong hash function to hide whatever flaws the hardware has. There is a pattern to the output, but it is computationally infeasible for the attacker to discover the pattern.
Point C in the diagram is similar to point B, except that the system relies on algorithms for most of its unpredictability.

In practice point B differs from point C by an enormous amount, far more than is suggested by the diagram. Point B may rely on algorithms for only 10−5 of its randomness, while point C may rely on physics for only 10−9 of its randomness ... so in this scenario the two RNGs differ by 14 orders of magnitude. That’s not infinite, but it’s a lot. It may be enough to cause the attacker to choose a different line of attack in the two cases.

It must be emphasized that an «ideal» PRNG cannot possibly exist. Any real PRNG depends on an internal state that must be initialized; that is, the PRNG needs a seed. That leaves us with a question: where does the seed come from? If it comes from another PRNG, it reduces to the problem previous not solved: Where did the upstream PRNG get its seed? The only way to avoid infinite regress is to use physics, to use a hardware RNG to provide the seed.

 “Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number – there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method.” – John von Neumann (reference 3)

Let’s be clear about the contrast:

 It is possible to have a RNG that relies completely on physics without needing any cryptologically-strong algorithms. Example: Roulette wheel. It is absolutely not possible to have a RNG that relies completely on algorithms without needing any physical source of randomness. The seed has to come from somewhere.

However, it would be difficult if not impossible to build a hardware-only RNG that is good enough for the most demanding applications. You can get more strength, higher data rates, and lower cost by using a hybrid approach, using both thermodynamics and cryptography.

 You can do better using physics and algorithms together than you could using either one separately.

### 1.4  Strengths and Weaknesses

1. The great weakness of any a PRNG is its internal state variable. If an adversary ever manages to capture a copy of my PRNG (including the internal state), then he can predict the future outputs of my PRNG, until the next time it is re-seeded.
2. For a badly-designed PRNG, if the state is captured, the adversary can ascertain previous outputs, not just future outputs. (A well-designed PRNG uses a strong one-way algorithm to prevent this.)
3. A HRNG can be built in such way that it does not depend on any persistent state. There is nothing for the adversary to capture.
4. A hardware-only HRNG may be random enough for some purposes but not others.
5. The rate at which a RNG produces randomly-distributed symbols is typically slower for a hardware-only HRNG that for a RNG that relies to some extent on computational feasibilty arguments.

This sometimes limits the usefulness of the HRNG. This can be very significant, because on many systems the peak demand for randomly-distributed symbols is very spiky. That is, the peak demand is many orders of magnitude greater than the average demand.

One seemingly-obvious way of dealing with the spiky demend is to buffer the output of the HRNG. However, this undercuts one of the main selling points of the HRNG, since the buffer constitutes a stored state that the adversary can capture. So once again we find that the difference between a practical HRNG and PRNG is a matter of degree, not a categorical difference.

6. Any PRNG is “easy” to break in the NP sense, and if broken it is completely broken, and it stays broken until the next time it gets re-seeded. The usefulness of the PRNG depends on the assumption that it is computationally infeasible for any adversary to infer the internal state by observing the outputs. (More precisely, the assumption is that the adversary will find such an attack to be not worth the trouble. This depends to some degree on the value of whatever the pseudo-random distribution is protecting.)
7. A decent HRNG cannot be completely broken using NP work or any other amount of work. It might leak “some” information. A strong hybrid RNG will leak much less than one bit in each 256-bit word. Most of the word will remain completely unpredicatable.
8. Every PRNG and hybrid RNG depends on assuming the existence of a one-way function that is resistant to cryptanalytic attack.

If the RNG is being used for a cryptographic application, you can presumably always use the same crypto primitives in the RNG that you use in the application. Then you don’t need to worry too much about cryptologic attacks on the RNG, because anything that breaks the RNG will also break the application directly, even worse than it breaks the RNG.

To be clear: There are standard techniques for using a block cipher to create a cryptologically-strong hash function. This is one way of providing the one-way property that the RNG requires.

9. To say the same thing the other way, if you use different primitives in the RNG, it essentially doubles the attack surface. There is the distinct possibility that the attacker can subvert the RNG and thereby break the system as a whole, even if the application crypto could not otherwise have been broken. See for example reference 4 and reference 5.

### 1.5  Hardware-Only HRNGs

There exist a number of “old-school” hardware-only methods for generating random distributions. This includes

• Tossing a coin.
• Rolling dice.
• Shuffling cards.
• Blindly drawing a token out of well-stirred urn. The so-called “powerball” machines use a variation of this idea.
• Dropping a marble into a roulette wheel or pachinko machine.
• et cetera.

Such methods are supposed to derive their randomness directly from physics, without computing any cryptographically-strong algorithms – indeed without using any computer at all.

A “powerball” machine is supposed to be tamper resistant, but in practice the tamper-resistance is far from perfect; see reference 6. A skilled cardsharp can perform a controlled shuffle that is not random at all; see reference 7. Similarly there is a controlled coin toss; see reference 8

Even when the hardware has not been subjected to tampering, it might not be random enough for the purpose; see reference 8 and reference 2.

### 1.6  Surprisal

Given a distribution P, the ith possible outcome is observed with probability Pi.

We define the surprisal aka surprise value of the ith possible outcome to be:

 \$i :=  log(1/Pi)              (1)

### 1.7  Characterizing the Distribution

Given a distribution, there are lots of things we might want to know about it. There is not any one “best” tool for characterizing a distribution. The available tools can be used in combination, and the details depend on what we’re trying to accomplish.

For any distribution P, the available tools include the family of Rényi functionals, denoted Hα[P], where α is called the order of the Rényi functional. The meaning is particularly clear for three special values of α:

H0[P]=log(number of possible symbols)    (log multiplicity)
=
log(
 ∑ i
1)
(2a)
S[P] = H1[P]=
 ∑ i
Pi log(1/Pi)
(entropy)
=
⟨\$⟩
(2b)
H[P]=
 Min i
log(1/Pi)
=
Min \$
(2c)

where the sums and the Min operations run over all symbols that can be drawn from the distribution with nonzero probability.

The formal, fully-general definition of the Rényi functionals is given in section 1.9.

Remarks:

• H0 is sometimes called the Hartley functional.
• H1 is the plain old entropy, S. It is sometimes called the Boltzmann functional or, equivalently, the Shannon functional. It is by definition the ensemble average of the surprisal, specifically a weighted average, even more specifically the expectation value of the surprisal.
• H is sometimes called the min-entropy, but this is at best a misnomer, for reasons discussed in section 1.9. It is better to call it the adamance, meaning roughly “hardness”, as in hard to guess. We shall be particularly interested in the adamance, as discussed below.
• For any α, the Rényi functional Hα[...] is sometimes called a “generalized” entropy. However, I prefer to reserve the term entropy for the plain old entropy, H1. The Rényi functionals that don’t have more specific names can be called simply Rényi functionals.

Some example distributions are shown in figure 4 and figure 5.

Figure 4: Slightly Skewed Distribution
The distribution in figure 4 has the following properties:

 H0[P] = 6 bits (3a) S[P] = H1[P] = 5.9 bits (3b) H∞[P] = 5.39 bits (3c)

 Figure 5: Highly Skewed Distribution Figure 6: More Highly Skewed Distribution

 The distribution in figure 5 has the following properties: The distribution in figure 6 has the following properties:

 H0[P] = 6 bits (4a) S[P] = H1[P] = 4 bits (4b) H∞[P] = 1.0 bit (4c)

 H0[P] = 8 bits (5a) S[P] = H1[P] = 2.371 bits (5b) H∞[P] = 0.331 bit (5c)

The relatively simple distributions in figure 5 and figure 6 can be generalized in various ways, including the following: Suppose there are N codes. One of them has probability q, and the remaining codes evenly share the remaining probability, so that they have probability (1−q)/(N−1) apiece. It is then easy to show that the distribution has the following properties:

H0[P]=log(N)     (6a)
S[P] = H1[P]=
q log
 1 q
+ (1−q) log
 1 1−q
+ (1−q) log(N−1)
(6b)
H[P]=
log(1/q)    or    log
 1−q N−1
(whichever is less)
(6c)

The following table shows some illustrative values. We consider the case where the symbols are 256-bit words, so N = 2256. We examine various values of q, and plug into equation 6.

 q H0 S H∞ 0.5 256 129.0000 1.0 10−3 256 255.7554 10.0 10−6 256 255.9998 19.9
Table 1: Properties of Skewed Distributions (256-bit Words)

This shows that the plain old entropy really doesn’t tell you everything you need to know. In the last line in the table, the entropy is very nearly as large as it possibly could be, yet the outcome is predictable one time out of a million. That’s about 70 orders of magnitude more predictable than it would be if the distribution were truly random.

The same idea is illustrated in figure 7 (for a 16 bit word) and figure 8 (for a 256 bit word). Knowing that the entropy is very high is not sufficient to guarantee that the outcome is unpredictable.

 Figure 7: Skewed Distribution : 16 Bit Word Figure 8: Skewed Distribution : 256 Bit Word

### 1.8  Infinite Entropy

An even more extreme example is shown in figure 9. The distribution has infinite entropy. This is arranged using techniques described in reference 9.

Figure 9: Distribution with Infinite Entropy (h=2.0)

In more detail, the distribution in figure 9 has the following properties:

 H0[P] = ∞ (7a) S[P] = H1[P] = ∞ (7b) H∞[P] = 1.95 bits (7c)

This serves as one more warning that the entropy is not always a good measure of resistance to guessing.

### 1.9  Properties of the Rényi Functionals

The entire family of Rényi functionals can be defined as follows, for any order from α=0 to α=∞:

Hα[P] :=
 1 1−α
log

 ∑ i
Piα

(8)

Again, the sum runs over all symbols that can be drawn from the distribution with nonzero probability. Note: In the important special case α=1, you can use l’Hôpital’s rule to get from equation 8 to equation 2b.

It is easy to prove the following inequalities:

 H∞≤ H1 ≤ H0
(9)

That means that when H approaches its maximum possible value, the entropy is also large, and all three measures are very nearly equal. However, the reverse is not true. When the entropy is large, even when it is very nearly equal to H0, the H value might be quite low ... as we see in table 1.

 High entropy, even saturated high entropy, does not imply high unpredictability.

It is also easy to prove that the Rényi functional (of any given order) is additive when we combine independent probabilities. That is:

 Hα[P] = Hα[Q] + Hα[R]
(10)

for any α, provided the distributions Q and R are statistically independent, where

 P = P[x, y] = joint probability Q = P[x] = marginal probability R = P[y] = the other marginal probability
(11)

By definition, “independent” that means the probabilities are multiplicative:

 P[x, y](xi, yj) = P[x](xi) × P[y](yj)
(12)

for all i and j. It is easy to prove that if the probabilites are multiplicative the Rényi functional is additive, using little more than the definition (equation 8) plus the distributive law.

We shall make heavy use of this result. In particular, for a string of symbols, where successive symbols are independent and identically distributed (IID), the entropy of the string grows in proportion to the length of the string, and the adamance also grows in proportion to the length of the string.

The adamance H[P] is a minimax measure of the guessability, since it measures how hard it is to guess the most easily guessible symbol that could be drawn from the distribution.

 (Since H∞[P] is a minimax measure, calling it the «min-entropy» would be a misnomer of the kind that leads to misconceptions. There are lots of other concepts that have at least as good a claim to represent some kind of “min” entropy.)
• If you want to guarantee that the randomness generator is hard to attack, you want it to have a large value of H[P], i.e. a large adamance.
• In contrast, if you want to guarantee that the RNG is easy to attack, you want it to have a small value of H0[P], i.e. a small log multiplicity.
• If you want to “split the difference”, you can consider the entropy, i.e. H1[P], which tells us something about the typical case or the average case.

To say the same thing another way:

• Adamance is relevant when the attackers can conduct a massively parallel attack, especially if they can claim victory after breaking only one of many messages.
• In contrast, entropy might be more relevant if the attackers need to break all messages, or break a collection of “typical” messages.

To appreciate the distinction between entropy and adamance, refer again to figure 9. It has infinite entropy but finite adamance. In some sense virtually all of the codewords are infinitely hard to guess, but the most-guessable codewords are not very hard at all.

In figure 9, the adamance is limited by the large “soft spots” i.e. low-surprisal codewords. You can increase the adamance somewhat by deleting the most-probable codewords and renormalizing what’s left, but the results are disappointing. You can delete any finite set of codewords and the distribution will still have infinite entropy and finite adamance.

The concept of entropy as used in classical cryptology and information theory is exactly the same thing as the concept of entropy as used in classical physics. In other words, the Shannon entropy is exactly the same thing as the Boltzmann entropy. Checking the equivalence is simple in some cases and complicated in others, but it has been checked.

In the unlikely event that you are dealing with entangled Schrödinger cat states, a fancier definition of entropy is required ... same idea, more general formula. Even then, the cryptological entropy remains the same as the physics entropy. Unsurprisingly, you need to upgrade both sides of the equivalence.

In all cases, entropy is defined in terms of probability. You can dream up lots of different probability distributions. If you use an unphysical distribution, you will get an unphysical value for the entropy.

Entropy is the answer to some interesting and important questions. It is not, however, the answer to all the world’s questions.

### 1.12  Conditional Randomness

Adamance and entropy are defined in terms of probability. In the same way, we can define conditional adamance and conditional entropy, defining them in terms of conditional probability. The same idea applies to other types of random distributions, including pseudo-random distributions. For more on this, see section 6.4.

### 1.13  Sample versus Distribution

As discussed in section 1.2, there is a world of difference between a distribution and a symbol that may have been drawn from such a distribtion.

The surprisal is by definition a property of the symbol. In contrast, the multiplicity, entropy, and adamance are properties of the distribution.

The only way to define entropy or adamance for a single symbol is to define a new distribution containing only that symbol, in which case the entropy and the adamance are both zero. In physical terms: if you thoroughly shuffle a deck in the usual way, the entropy and the adamance are both log(52!). However, if you shuffle it and look at the result afterward, then the adamance and the entropy are both zero.

## 2  Goals and Non-Goals

In this section, we give a broad outline of what is possible and what is not possible; a more detailed discussion of objectives and tradeoffs can be found in section 2.2.

The basic objective is to provide a random distribution of symbols, with quality and quantity suitable for a wide range of applications.

### 2.1  Use Case and Threat Models

In the interest of clarity, let’s consider a specific use case and a couple of specific threat models. (Some additional applications are listed in section 2.3. Some additional attacks are discussed in section 6.1 and section 6.3.)

Imagine that we are sending encrypted messages. The RNG is used to create an initialization vector for each message. There are two cases to consider:

• Suppose we send a large number of messages, and the attacker is required to break all the messages, or at least a set of “typical” messages. Then the plain old entropy is a good measure. It reflects the average work per message.
• Suppose we send a great many messages, and the attacker can succeed by breaking any one of them. The attacker can apply his best guess to each message, and then move on to the next message. He never needs to bother with his second-best guess. To defend against this, we need to pursue a minimax strategy. That is, we need to make sure that even our weakest message is hard to guess ... not just the average message. In this case, the adamance H is a more appropriate measure of the quality of the distribution.

Here are some of the criteria on which a random generator can be judged:

1. Efficient use of CPU cycles.
2. Efficient use of the available raw entropy and adamance.
3. Ability to meet short-term peak demand.
4. Resistance to outside attack, i.e. cryptanalytic attack on the outputs.
5. Anti-backtracking. That is, if the attacker captures the internal state of the RNG, this does not allow him to determine previous outputs. We cannot possibly prevent him from predicting future outputs (unless and until the RNG is reseeded) but we need not and should not let him have the previous outputs.

In particular, hypothetically, an encrypted counter would make a fine PRNG if we disregarded the anti-backtracking requirement.

6. Rapid recovery from compromise, i.e. from capture of the internal state. The state might be captured by cryptanalysis of the outputs, or (more commonly) by bypassing the crypto and peeking at the state variables.

This is a tricky requirement. In some sense rapid recovery is desirable, but it is not the main requirement, and there are limits to what can be accomplished. It’s like a seatbelt in an airliner. It protects you against injury during routine turbulence, but if the aircraft explodes in midair the seatbelt is not going to do you a bit of good. Spending precious resources on «better» seatbelts would make the overall system less safe. The point is, the primary defense against compromise is to design the overall system so that compromises are extraordinarily rare.

Secondarily, it may be that cryptanalysis allows the attacker to ascertain one or two bits of the internal state. The RNG should make it computationally infeasible for the attacker to exploit this information, and the RNG should be re-seeded often enough to ensure that this information is lost before it accumulates to any significant level.

Let’s be clear: Constanty re-seeding the PRNG is a bad idea. It wastes CPU cycles, and more importantly it wastes the randomness that is being taken from the upstream HRNG that is providing the seed material. If you are constantly worried about compromise, the solution is not more re-seeding; the solution is to redesign the system to make it more resistant to compromise.

In particular, if an attacker can cause the PRNG to re-seed itself frequently, it becomes a denial-of-service attack against the upstream HRNG.

These criteria conflict in complex ways:

• An ideal TRNG has no internal state, so it recovers immediately from an inside attack. On the other hand, it has limited ability to respond to peak demand.
• Adding a fifo buffer to the output of the TRNG greatly improves its response to peak demand, but now there is considerable internal state that can be captured by an inside attack.
• There exist “hybrid” designs where the TRNG output buffer is combined with a PRNG. The idea is to provide a measure of error concealment, by making it computationally infeasible for for the attacker to gain much benefit from capturing the internal state. In other words, the supposed TRNG functions temporarily as a PRNG. This does not improve its ability to return to TRNG functionality, but it may confer some advantage in practice, following a compromise. On the other hand, it comes at a cost in terms of extra CPU cycles during normal (non-compromised) operation.
• Re-seeding a PRNG extra-often can greatly accelerate the recovery from compromise. However, it comes at a cost in terms of extra consumption of CPU cycles and extra consumption of entropy during normal (non-compromised) operation.

These conflicts and tradeoffs leave us with more questions than answers. There are as-yet unanswered questions about the threat model, the cost of CPU cycles, the cost of obtaining raw entropy, et cetera. The answers will vary wildly from machine to machine.

### 2.3  Applications of Random Distributions

Here is a list of typical applications where some degree of randomness is needed, followed by a discussion of the demands such applications place on the random generator:

1. Monte Carlo integration, for solution of problems in physics, decision analysis, et cetera.
2. Simulated annealing, e.g. for solution of combinatorial optimization problems.
3. Assigning members to the “test” and “control” groups in a statistical study.
4. Implementation of games and gambling, such as deciding the winner of a lottery.
5. Decision-making for business, military, and gaming applications, such as choosing what lottery number to play.
6. Cryptography.
7. Generating seeds for Pseudo-Random Generators .

The first three items require a random distribution with good statistical uniformity and independence, but do not usually require resistance against cryptanalytic attack. The molecules in a Monte Carlo simulation are not going to attack the Random Generator. Similarly, the shuffle used in a low-stakes game of Crazy Eights does not require a cryptographically strong RNG, since nobody is going to bother attacking it.

As the stakes get higher, the demands on the RNG get stricter. Implementing a multi-million dollar lottery requires the RNG to have excellent tamper-resistance as well as good statistical properties.

Game players need randomness, too. In the simple scissors/paper/stone game, choosing a move at random is the minimax strategy. More sophisticated games like poker require a mixed strategy, i.e. a mixture of deterministic and non-deterministic considerations. You must study the cards, study the other players, and sometimes make an unpredictable decision about whether to bluff or not.

Cryptography is far and away the predominant high-volume consumer of high-grade entropy. This is what motivated our search for better generators. Even a smallish secure-communications gateway can draw millions of bits from the random distribution in a one-minute period.

Item #7 requires special comment: Suppose you need to initialize the internal state in a PRNG. As discussed in section 1.3, at some point this requires a HRNG.

## 3  High-Quality Random Generator

### 3.1  Overview

Figure 10 shows a block diagram of a High-Quality Random Generator. The symbols it generates are very nearly as random as possible.

Figure 10: High-Quality Random Generator

The starting point for an HRNG is the unpredicability associated with one or more physical processes, such as radioactive decay or thermal noise. Such physical processes are partially random, in the sense that they contain contributions which are believed to be extraordinarily difficult for anyone to predict, perturb, or replay.

The raw symbols (as sampled from the data-acquisition apparatus) will, in general, be partly predictable and partly unpredictable, as discussed in section 6.4. The raw symbols are essentially 100% resistant to the attacks that plague PRNGs . That’s because those attacks are directed toward capturing the internal state, and the HRNG doesn’t have any internal state worth capturing.

### 3.2  Linear Stage

Let’s consider an example. As always, the raw data comes from a physical process that produces some randomness. For simplicity, let’s toss a bent coin. It comes up heads 90% of the time and tails 10% of the time.

We consider one toss to be one symbol. We can encode such a symbol using one bit, or using 7-bit ASCII, or using an 8-bit byte, or using a 32-bit word, or whatever. It doesn’t matter, because the multiplicity is 2 possibilities per symbol. In more detail:

 H∞ = 0.152 bits per symbol = adamance H1 = 0.469 bits per symbol = entropy H0 = 1 bit per symbol = log multiplicity
(13)

Next, consider a string consisting of three symbols from the same source. There will be 8 possible strings. We assume the symbols are independent and identically distributed (IID). We can characterize the statistics of the strings as follows:

 H∞ = 0.456 bits per string = adamance H1 = 1.407 bits per string = entropy H0 = 3 bits per string = log multiplicity
(14)

Similarly for a string of seven symbols, there will be 128 possible strings. In more detail:

 H∞ = 1.064 bits per string = adamance H1 = 3.283 bits per string = entropy H0 = 7 bits per string = log multiplicity
(15)

Even finer details of the calculation are shown in the following table:

 # of tails (k): 0 1 2 3 4 5 6 7 P[heads] P[tails] 0.9 0.1 per string per symbol log improb: 3.32 0.152 symbols per string (n): 1 prob per state: 0.9 0.1 adamance: 0.152 0.152 total prob: 1 prob per level: 0.1 0.9 entropy: 0.469 0.469 # strings: 2 multiplicity (n choose k): 1 1 hartley: 1 1 log improb: 0.456 3.63 6.8 9.97 symbols per string (n): 3 prob per state: 0.729 0.081 0.009 0.001 adamance: 0.456 0.152 total prob: 1 prob per level: 0.729 0.243 0.027 0.001 entropy: 1.41 0.469 # strings: 8 multiplicity (n choose k): 1 3 3 1 hartley: 3 1 log improb: 1.06 4.23 7.4 10.6 13.7 16.9 20.1 23.3 symbols per string (n): 7 prob per state: 0.478 0.0531 0.0059 0.000656 7.29e-05 8.1e-06 9e-07 1e-07 adamance: 1.06 0.152 total prob: 1 prob per level: 0.478 0.372 0.124 0.023 0.00255 0.00017 6.3e-06 1e-07 entropy: 3.28 0.469 # strings: 128 multiplicity (n choose k): 1 7 21 35 35 21 7 1 hartley: 7 1

### 3.3  Simplification: Adamance = Entropy

Note that the seven-symbol strings have approximately one bit of adamance per string.

At this point we have the opportunity to introduce a remarkable simplification: We replace the strings. Rather than using seven tosses of a bent coin, we use a single toss of a fair coin. We can describe this by saying the bent coin is a bare symbol, while the fair coin is a dressed symbol. Seven bare symbols are replaced by one dressed symbol. The dressed symbols have the nice property the adamance, entropy, and log multiplicity are all the same, namely one bit per dressed symbol.

This means that you can (to a limited extent) use your intuition about entropy to help understand what’s going on. Also it allows us to use counting arguments rather than adding up analog quantities.

### 3.4  Tail Risk

Suppose we have a source of raw data where there are 17 possible strings, all equally probable. Then the adamance, entropy, and log multiplicity of the ensemble of codewords are all the same, namely 4.1 bits, i.e. log2(17).

Now suppose we run these strings through a hash function. We start by considering a hash function that can only put out 169 possible codewords. This is very different from a real-world hash function that puts out 2256 different codewords, but it serves as a useful warm-up exercise.

Here is a map of all 169 codewords, as a 13×13 array, with one entry per codeword. The words that are actually produced, in response to the 17 raw data strings, are marked with a 1. The adamance, entropy, and log multiplicity of the ensemble of codewords are all the same, as given in equation 16.

 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 Raw Data = 4.09 bits Hartley = 4.09 bits Entropy = 4.09 bits Adamance = 4.09 bits
(16)

Here is another example of the same sort of thing, but using either a slightly different hash function, or a slightly different ensemble of raw data strings. This time there is a hash collision. That is to say, there are two different raw data strings that hash to the same codeword. This is indicated by a boldface 2 in the array. We say that 15 of the cells are singly occupied, while one of the cells is doubly occupied. The details are given in equation 17.

 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

 Raw Data = 4.09 bits Hartley = 4 bits Entropy = 3.97 bits Adamance = 3.09 bits
(17)

In the two previous examples, we say that the fill factor was very nearly 0.1, because we have 17 raw data strings hashed into 169 cells.

We now replace the raw data source with something that produces more adamance, namely 169 possible strings, all equally probable. That corresponds to 7.4 bits. The hash map is below. The details are given in equation 18.

 2 1 2 2 1 2 1 3 1 2 0 0 1 1 1 0 1 2 0 1 3 0 0 0 0 4 1 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 2 1 1 2 1 0 2 0 1 0 1 3 0 1 0 0 3 0 0 1 1 0 0 1 1 1 1 2 1 2 0 1 3 1 0 1 1 1 0 3 0 2 1 3 1 2 5 0 0 0 0 0 0 1 1 0 0 2 1 2 3 1 0 3 0 0 0 1 0 1 2 0 1 2 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 1 2 2 2 1 2 2 2 0 1 0 1 2 3 0 1 0 1 3 4 2 1 1 0 3 4 0

 Raw Data = 7.4 bits Hartley = 6.73 bits Entropy = 6.54 bits Adamance = 5.08 bits
(18)

In all such maps, the entries have units of pips. One pip is equal to the probability of one raw data string. In the last example, one pip is 1/169th of the total probability.

We now move up to a fill factor of 100.

 102 97 100 121 107 77 97 93 103 90 107 107 93 90 117 93 97 100 98 99 112 91 105 99 106 102 88 109 92 103 110 94 105 94 104 101 97 107 113 94 110 93 109 87 105 108 94 87 92 110 109 90 90 111 108 102 90 88 106 107 106 100 92 89 102 95 89 93 120 90 104 104 98 95 112 100 105 101 100 79 106 119 98 93 83 99 115 114 85 99 116 92 94 99 111 105 99 91 105 109 85 92 99 92 112 92 92 109 101 109 105 107 127 95 105 106 91 119 99 97 96 105 104 126 119 98 99 90 98 91 90 100 104 111 99 94 107 83 93 119 90 101 96 133 117 103 80 100 110 108 86 95 84 87 95 97 85 95 88 86 107 82 119 86 91 111 109 96 87

 Raw Data = 14 bits Hartley = 7.4 bits Entropy = 7.39 bits Adamance = 6.99 bits
(19)

There will always be outliers, although distant outliers will be very improbable. Let’s set a bound of 2r on the tail risk. Later we will set r to some huge number, like 256. That means it is far more likely that your computer will be destroyed by a meteor than for this approximation to cause trouble.

We can model the structure of the map by saying that each cell very nearly follows a binomial distribution. For large fill factors, which is the case we care about, this is well approximated by a Gaussian. The mean is exactly equal to the fill factor Φ, while the standard deviation is √Φ.

The tail risk is given by

risk = 2r
+
1 − [½ + ½ erf(
 x−µ σ √2
)]
=
½ − ½ erf(
 x−µ σ √2
)
=
½ erfc(
 x−µ σ √2
)
(20)

So we can write

em := 2/2r
=
erfc(
 x−µ σ √2
)
m = −ln(erfc(y))
(21)

where we have introduced

y :=
 x−µ σ √2

m = (r−1) ln(2)
(22)

The inverse relationship is:

 y = erfc−1(e−m)
(23)

For m less than 25, we can evaluate equation 23 directly. The function is shown in figure 11

Figure 11: Inverse Error Function of Exponentially Small Argument

For m greater than 12, we can approximate the erfc to better than 0.1% via:

erfc(y) =
 exp(−y2) √π
 2 y + √(y2 + 4/π)

(24)

Taking the logarithm, we get:

ln(erfc(y)) =
y2 + ln(2/√π) − ln(y +
 y2 + 4/π
)
(25)

We can approximately solve for y in terms of m:

y1 =
 m

yi+1 =
 m + ln(2/√π) − ln(y + √(y2 + 4/π))

(26)

You can see from figure 11 that the y1 approximation is fairly decent. The next iteration (y2) is good enough for all practical purposes.

Combining some more:

x−µ = y σ √2(measured in pips)
x = µ + y σ √2
=
Φ + y
 2 Φ

=
Φ (1 + y
 2 / Φ
)

(27)

where we have made use of the fill factor Φ:

 Φ = 2A / 2W Φ 2−A = 2−W
(28)

where A is the adamance (in bits) of the raw data, and W is the width (in bits) of the hash codeword.

Px = x 2A(normalized as a probability)
=
Φ 2A (1 + y
 2 / Φ
)

=
2W (1 + y
 2 / Φ
)

(29)

where y is still considered known as a function of m, hence as a function of r, i.e. as a function of the risk we have decided to tolerate. It’s not a very big number. When the risk is 2−256 i.e. r=256, y is only about 13.3.

So ...

log2(1/Px) =
−log2(2W (1 + y
 2 / Φ
))

=
W − log2(1 + y
 2 / Φ
)

=
W − ln(1 + y
 2 / Φ
) / ln(2)

=
W − y
 2 / Φ
) / ln(2)

(30)

where W is the best the RNG could possibly do. Plugging in an estimate for y gives us

log2(1/Px) =
W −

 2 (r−1) Φ ln(2)

 ½

deficit =

 2 (r−1) log2(e) Φ

 ½

(31)

Solve for Φ in terms of the deficit:

Φ =
 2 (r−1) log2(e) deficit2
(32)

If we say we can tolerate a deficit of 0.01 bits, that means there is 255.99 bits of adamance in a 256-bit word. That means the hash-function codeword has 99.996% of the maximum possible adamance. This can be achieved using a fill-factor of about 7.33 million.

Note that 8 million is about 23 bits, so the raw data string needs to have an adamance of at least 256 + 23 = 279 bits. This means the RNG is making reasonably efficient use of the available randomness.

Furthermore, unless the hash function is spectacularly broken, we expect it to be computationally infeasible for the attacker to find or exploit the missing 0.01 bits.

If you add another 7 bits of adamance to the raw data string, it increases the fill-factor by a factor of 128, which knocks the deficit down by an order of magnitude.

## 4  The Role of Measurement and Testing

### 4.1  Issues of Principle; Cat and Mouse

By way of analogy, let us consider the relationship between code-breakers and code-makers. This is a complex ever-changing contest. The code-breakers make progress now and then, but the code-makers make progress, too. It is like an “arms race” but much less symmetric, so we prefer the term cat-and-mouse game.

At present, by all accounts, cryptographers enjoy a very lopsided advantage in this game.

There is an analogous relationship between those who make PRNGs and those who offer tools to test for randomness. The PRNG has a hidden pattern. Supposedly the tester “wants” the pattern to be found, while the PRNG-maker doesn’t.

We remark that in addition to the various programs that bill themselves as randomness-testers, any off-the-shelf compression routine can be used as a test: If the data is compressible, it isn’t random.

To deepen our understanding of the testing issue, let’s consider the following scenario: Suppose I establish a web server that puts out pseudo-random bytes. The underlying PRNG is very simple, namely a counter strongly encrypted with a key of my choosing. Each weekend, I choose a new key and reveal the old key.

The funny thing about this scenario is the difference between last week’s PRNG and this week’s PRNG. Specifically: this week’s PRNG will pass any empirical tests for randomness that my adversary cares to run, while last week’s PRNG can easily be shown to be highly non-random, by means of a simple test involving the now-known key.

As a modification of the previous scenario, suppose that each weekend I release a set of one billion keys, such that the key to last week’s PRNG is somewhere in that set. In this scenario, last week’s PRNG can still be shown to be highly non-random, but the test will be very laborious, involving on the order of a billion decryptions.

Note that to be effective against this PRNG, the randomness-testing program will need to release a new version each week. Each version will need to contain the billions of keys needed for checking whether my PRNG-server is “random” or not. This makes a mockery of the idea that there could ever be an all-purpose randomness tester. Even a tester that claims to be “universal” cannot be universal in any practical sense.1 An all-purpose randomness tester would be tantamount to an automatic all-purpose encryption-breaking machine.

### 4.2  Necessary but Not Sufficient

To paraphrase Dijkstra: Measurement can prove the absence of entropy, but it cannot prove the presence of entropy. More specifically, a “randomness testing” program will give an upper bound on the entropy density of a random generator, but what we need is a lower bound, which is something else entirely.

The turbid calibration procedure calculates a lower bound on the adamance density, based on a few macroscopic physical properties of the hardware. It is entirely appropriate to measure these macroscopic properties, and to remeasure them occasionally to verify that the hardware hasn’t failed. This provides a lower bound on the entropy, which is vastly more valuable than a statistically-measured upper bound.

As discussed in section 8.5, tests such as Diehard and Maurer’s Universal Statistical Test are far from sufficient to prove the correctess of turbid. They provide upper bounds, whereas we need a lower bound.

When applied to the raw data (at the input to the hash function) such tests report nothing more than the obvious fact that the raw data is not 100% random. When applied to the output of the hash function, they are unable to find patterns, even when the adamance density is 0%, a far cry from the standard (100%) we have set for ourselves.

A related point: Suppose you suspect a hitherto-undetected weakness in turbid. There are only a few possibilities:

• If you think there is a problem with the hash function that turbid uses, it would make more sense to attack the hash function by conventional means, including artfully chosen inputs, as opposed to haphazardly probing it with whatever raw data is coming off the data-acquisition system.
• If you think the problem is upstream of the hash function, you should instrument the code and look there (at or before the input to the hash function). Running the allegedly problematic data through the hash function and looking at the output is like looking through the wrong end of a really long telescope; it greatly reduces your ability to see anything of interest.
• Probably the best chance of getting a non-null result from a “randomness test” is if there is a gross bug in the control logic, such as forgetting to write the result to the output. For this to work, you need to turn off the error concealment as discussed in section 5.

The foregoing is a special case of a more general rule: it is hard to discover a pattern unless you have a pretty good idea what you’re looking for. Anything that could automatically discover general patterns would be a universal automatic code-breaker, and there is no reason to believe such a thing will exist anytime soon.

There are some tests that make sense. For instance:

• The program checks the supposed LSB (Least Significant Bit) of the raw data to make sure it is in fact significant – not stuck or easily predictable from the higher bits.
• If you don’t believe the theoretical arguments, you could test the theory by (temporarily!) substituting a crippled hash function (producing, say, 8-bit hashcodes rather than 160-bit hashcodes). Then the statistical tests might have a chance of detecting problems upstream of the hash function, e.g. a miscalculation of the entropy density of the raw data.

On the one hand, there is nothing wrong with making measurements, if you know what you’re doing. On the other hand, people have gotten into trouble, again and again, by measuring an upper bound on the entropy and mistaking it for a lower bound.

The raison d’etre of turbid is that it provides a reliable lower bound on the adamance density, relying on physics, not relying on empirical upper bounds.

### 4.3  Actual Measurement Results

We ran Maurer’s Universal Statistical Test a few times on the output of turbid. We also ran Diehard. No problems were detected. This is totally unsurprising, and it must be emphasized that we are not touting this a serious selling point for turbid; we hold turbid to an incomparably higher standard. As discussed in section 4.2, we consider it necessary but not sufficient for a high-entropy random generator to be able to pass such tests.

### 4.4  Summary

To summarize this subsection: At runtime, turbid makes specific checks for common failures. As discussed in section 4.3 occasionally but not routinely apply general-purpose tests to the output.

We believe non-specific tests are very unlikely to detect deficiencies in the raw data (except the grossest of deficiencies), because the hash function conceals a multitude of sins. You, the user, are welcome to apply whatever additional tests you like; who knows, you might catch an error.

## 5  Error Detection versus Error Concealment

### 5.1  Basics

We must consider the possibility that something might go wrong with our high entropy random generator. For example, the front-end transistor in the sound card might get damaged, losing its gain (partially or completely). Or there could be a hardware or software bug in the computer that performs that hashing and control functions. We start by considering the case where the failure is detected. (The other case is considered in section 5.2.)

At this point, there are two major options:

• Throtting: We can make the appropriate (partial or complete) reduction the rate at which turbid emits symbols, so that whatever symbols are emitted continue to have the advertised high adamance density.
• Concealment: We can conceal the error and continue to emit symbols at the same old rate. This means that turbid has been degraded from a high-entropy random generator to some kind of SRNG or PRNG.

We can ornament either of those options by printing an error message somewhere, but experience indicates that such error messages tend to be ignored.

If the throttling option is selected, you might want to have multiple independent generators, so that if one is down, you can rely on the other(s).

The choice of throttling versus concealment depends on the application. There are plenty of high-grade applications where concealment would not be appropriate, since it is tantamount to hoping that your adversary won’t notice the degradation of your random generator.

In any case, there needs to be an option for turning off error concealment, since it interferes with measurement and testing as described in section 4.

### 5.2  Combining Generators

We can also try to defend against undetected flaws in the system. Someone could make a cryptanalytic breakthrough, revealing a flaw in the hash algorithm. Or there could be a hardware failure that is undetected by our quality-assurance system.

One option would be to build two independent instances of the generator (using different hardware and different hash algorithms) and combine the outputs. The combining function could be yet another hash function, or something simple like XOR.

### 5.3  Hashing and/or Whitening

Figure 10 can be simplified to

 Source → Digitizer → Hash                              (33)

A number of commentators have suggested that the HRNG needs to use a cipher such as DES to perform a “whitening” step, perhaps

 Source → Digitizer → Hash → Whitener              (34)

or perhaps simply

 Source → Digitizer → Whitener              (35)

We consider such whitening schemes to be useless or worse. It’s not entirely clear what problem they are supposed to solve.

Specifically: There are only a few possibilities:

• If the hash function is cryptologically strong, its output is as unpredictable as it possibly could be, and adding an additional encryption step is just a waste of CPU cycles.
• If the hash function is wasting entropy (due to unnecessary hash collisions), then scheme 34 is a defective Random Generator. The whitener can at best conceal (not eliminate) the defects, and this is not even a very good way to perform error-concealment.

In general, people are accustomed to achieving reliability using the belt-and-suspenders approach, but that only works if the contributions are in parallel, not in series. A chain is only as strong as its weakest link.

In this case, it is a fallacy to think that a whitener can compensate for a weakness in the hash function. As an extreme example, consider the simplest possible hash function, one that just computes the parity of its input. This qualifies as a hash function, because it takes an arbitrary-length input and produces a constant-length output. Now suppose the data-acquisition system produces an endless sequence of symbols with even parity. The symbols have lots of variability, lots of entropy, but always even parity. The output of the hash function is an endless sequence of zeros. That’s not very random. You can run that through DES or any other whitener and it won’t get more random.

As a less extreme example, consider the WEAKHASH-2 function described in section 8.2. The whitener can conceal the obvious problem that all hashcodes have odd parity, but it cannot remedy the fundamental weakness that half the codes are going unused. The problem, no matter how well concealed, is still present at the output of the whitener: half the codes are going unused.

This is easy to understand in terms of entropy: once the entropy is lost, it cannot be recreated.

It is sometimes suggested that there may exist relatively-simple hash functions that have good mixing properties but require a whitener to provide other cryptologic strengths such as the one-way property. We respond by pointing out that the HRNG does not require the one-way property. We recognize that a Pseudo-Random Generator is abjectly dependent on a one-way function to protect its secret internal state, but the HRNG has no secret internal state.

We recognize that the HRNG may have some internal state that we don’t know about, such as a slowly-drifting offset in the data-acquisition system. However, we don’t care. The correctness of the final output depends on the variability in the raw data, via the hash-saturation principle. As long as there is enough unpredictable variability in the raw data, the presence of additional variability (predictable or not) is harmless.

As a practical matter, since the available hash functions are remarkably computationally efficient, it is hard to motivate the search for a “simpler” hash functions, especially if they require whitening or other post-processing.

Finally, we point out that the last two steps in scheme 34 are backwards compared to scheme 36. If you really want whiteness, i.e. good statistical properties, it is better to put the crypto upstream of the contraction.

## 6  Threats and Attacks Against Random Generators

### 6.1  Threats and Attacks against Pseudo-Random Generators

For some applications, a well-designed PRNG may be good enough. However, for really demanding applications, at some point you might well throw up your hands and decide that implementing a good HRNG is easier.

A typical PRNG, once it has been seeded, depends on a state variable, a deterministic update function, and some sort of cryptologic one-way function. This allows us to classify typical PRNG attacks and failures into the following categories:

1. Improper seeding, i.e. internal state never properly initialized.
2. Leakage of the internal state over time. This rarely involves direct cryptanalytic attack on the one-way function, leading to leakage through the PRNG’s output channel. More commonly it involves side-channels.
3. Improper stretching of limited entropy supplies, i.e. improper reseeding of the PRNG, and other re-use of things that ought not be re-used.
5. Gross defects, such as outputs that are somewhat predictable even if you don’t know the internal state.

The historical record contains many examples of failed PRNGs. For example:

1. As described in reference 10, early versions of the Netscape browser tried to initialize their PRNG using the time-of-day clock, which provides an example of failure 1. You have to consider the possibility that an adversary knows how to tell time.
2. As described in reference 11, the VENONA project provides a remarkable example of failure 3. The Soviets evidently suffered a shortage of high-grade randomly-distributed numbers, so they started re-using their One Time Pads. This was detected and exploited.
3. Openssl got bitten, also; See reference 12.
4. Kerberos got bitten, also: See reference 13, which also contains a good general discussion of why randomness is important.
5. The Taiwanese Citizen Digital Certificate program (reference 14) is an important widely-deployed system with a fatally-flawed RNG.
6. The Android Java Cryptography Architecture did not properly initialize its PRNG. This was exploited to steal stuff. See reference 15.
7. Reference 16 contains a section “How to cheat in online gambling” that provides additional examples of what can go wrong with PRNGs.
8. Another compendium of broken PRNGs can be found in reference 17.

See reference 18 for an overview of ways in which PRNGs can be attacked.

The following scenario serves to illustrate a few more of the ideas enumerated above: Suppose you are building a network appliance. It has no mouse, no keyboard, no disk, and no other easy sources of randomness (unless you use the audio system, or something similar, as described in this document). You want to sell millions of units. They start out identical, except possibly that each one has a network interface with a unique MAC address.

Seeding the PRNG with the MAC address is grossly inadequate. So the only way to have a halfway acceptable PRNG is to configure each unit by depositing into it a unique PRNG state vector. This means that each unit needs a goodly amount of writable but non-volatile storage; it can’t execute out of ROM and volatile RAM alone. Also, the stored state-vector needs to be updated from time to time; otherwise every time the machine is rebooted it will re-use the exact same numbers as last time, which would be an example of failure failure 3. Note that the standard Linux distributions only update the stored seed when the system is given a shutdown command – not if it goes down due to sudden power failure or software crash – which is unacceptable. You have to protect the state vector for all time. In particular, if a rogue employee sneaks a peek at the state vector (either on the machine itself, or on a backup tape, or whatever) and sells it to the adversaries, they can break all the traffic to and from the affected unit(s) from then on, which would be an example of failure failure 2. All these problems can be dealt with, but the cost might be so high that you give up on the PRNG and use a HRNG instead.

A subtle type of improper reseeding or improper stretching (failure 3) is pointed out in reference 19. If you have a source of entropy with a small but nonzero rate, you may be tempted to stir the entropy into the internal state of the PRNG as often as you can, whenever a small amount of entropy (ΔS) becomes available. This alas leaves you open to a track-and-hold attack. The problem is that if the adversaries had captured the previous state, they can capture the new state with only 2ΔS work by brute-force search, which is infinitesimal compared to brute-force capture of a new state from scratch. So you ought to accumulate quite a few bits, and then stir them in all at once (“quantized reseeding”). If the source of entropy is very weak, this may lead to an unacceptable interval between reseedings, which means, once again, that you may be in the market for a HRNG with plenty of throughput, as described in this document.

The Linux kernel random generator /dev/random (reference 20) accumulates entropy in a pool and extracts it using a hash function. It is associated with /dev/urandom which is the same, but becomes a PRNG when the pool entropy goes to zero. Therefore in some sense /dev/urandom can be considered a stretched random generator, but it has the nasty property of using up all the available entropy from /dev/random before it starts doing any stretching. Therefore /dev/urandom provides an example of bad side effects (failure 4). Until the pool entropy goes to zero, every byte read from either /dev/random or /dev/urandom takes 8 bits from the pool. That means that programs that want to read modest amounts of high-grade randomness from /dev/random cannot coexist with programs reading large amounts of lesser-grade randomness from /dev/urandom. In contrast, the stretched random generator described in this document is much better behaved, in that it doesn’t gobble up more entropy than it needs.

It is certainly possible for PRNG failures to be found by direct analysis of the PRNG output, for example by using statistical tools such as reference 21.

More commonly, though, PRNGs are broken by attacking the seed-storage and the seed-generation process. Here are some examples:

a) On a system that relies on /dev/urandom in the absence of sources of real entropy, the attackers capture a backup tape containing the /var/lib/.../random-seed file. Then they provoke a crash and restart, perhaps by interrupting the power. Then they know the output of /dev/urandom for all time thereafter.
b) A related point: Suppose we are running a high-stakes lottery. Who provided the original seed for the random generator we will use? Even assuming (!) we can protect the seed for all times greater than t=0, where did the t=0 seed come from? If you provided it, how do I know you didn’t keep a copy?

If a PRNG contains N bits of internal state, it must repeat itself with a period no longer than 2N. Obviously, N must be large enough to ensure that repetition never occurs in practical situations. However, although that is necessary, it is far from sufficient, and making the period longer is not necessarily a good way to make the PRNG stronger. By way of example: A 4000-bit Linear Feedback Shift Register (LFSR) has a period of 24000 or so, which is a tremendously long period ... but the LFSR can easily be cryptanalyzed on the basis of only 4000 observations (unless it is protected by a strong one-way function installed between the LFSR and the observable output). For a Stretched Random Generator (section 9), it is necessary to have a period long compared to the interval between reseedings, but it is not necessary to make it much longer than that. So, for a SRNG, a huge period is neither necessary nor sufficient. For a PRNG that is not being reseeded, a huge period is necessary but not sufficient.

This is worth emphasizing: One key difference between a genuinely entropic random generator and a pseudo-random generator is that for the PRNG you have to worry about where you get the initial seed, how you recover the seed after a crash/restart, and how you protect the seed for all time, including protecting your backup tapes. For the HRNG you don’t.

See section 1.5.

### 6.3  Threats and Attacks Against the Turbid Design

Attacks against turbid. and similar systems are very different from the usual sort of attack against a PRNG (such as are discussed in section 6.1).

The question is sometimes asked whether thermal noise is really “fundamentally” random. Well, it depends. Obviously, the magnitude of thermal noise depends on temperature, and theoretically an adversary could reduce the temperature of your computer to the point where the input signal calibration was no longer valid. In contrast, there are other processes, such as radioactive decay and quantum zero-point motion, that are insensitive to temperature under ordinary terrestrial conditions. This makes thermal noise in some sense “less fundamental”. However, the distinction is absolutely not worth worrying about. If somebody can get close enough to your computer to radically change the temperature, attacks against the HRNG are the least of your worries. There are other attacks that are easier to mount and harder to detect.

This is a special case of a more-general observation: The most-relevant attacks against the HRNG don’t attack the fundamental physics; they attack later stages in the data acquisition chain, as we now discuss.

Suppose you choose radioactive decay, on the theory that it is a “more fundamentally” random process. The decay process is useless without some sort of data acquisition apparatus, and the apparatus will never be perfect. For one thing, the detector will presumably have live-time / dead-time limitations and other nonidealities. Also, an adversary can interfere with the data acquisition even if the fundamental decay process is beyond reach. Last but not least, the raw signal will exhibit a Poisson distribution, which does not match the flat distribution (all symbols equally likely, i.e. 100% entropy density) that we want to see on the final HRNG output. Therefore the acquired signal will not be one whit more useful than a signal derived from thermal noise. The same sort of postprocessing will be required.

Similar remarks apply to all possible sources of hardware randomness: we do not expect or require that the raw physics will be 100% random; we merely require that it contains some amount of guaranteed randomness.

One very significant threat to the HRNG is the possibility of gross failure. A transistor could burn out, whereupon the signal that contained half a bit of entropy per sample yesterday might contain practically nothing today.

Another very significant threat comes from the possibility of bugs in the software.

Tampering is always a threat; see reference 4 and reference 5.

Even if everything is working as designed, we have to be careful if the noise is small relative to Q, the the quantum of significance of the digitizer, i.e. the voltage represented by least-significant bit.

A more subtle threat arises if the digitizer has “missing codes” or “skimpy codes”. A soundcard that is advertised as having a 24-bit digitizer might really only have a 23-bit digitizer. Most customers wouldn’t notice.

An adversary could mount an active attack. The obvious brute-force approach would be to use a high-power electromagnetic signal at audio frequencies (VLF radio), in the hopes that it would couple to the circuits in the sound card. However, any signal of reasonable size would just add linearly to the normal signal, with no effect on the process of harvesting entropy.

A truly huge signal might overload and saturate the audio input amplifier. This would be a problem for us, because thermal noise would have no effect on an already-saturated output. That is, the small-signal gain would be impaired. An attack of this kind would be instantly detectable. Also, it is likely that if attackers could get close enough to your computer to inject a signal of this magnitude, they could mount a more direct attack. Remember, if we are using a 24-bit soundcard we have almost 16 bits of headroom (a factor of 10,000 or so) between the normal signal levels and the saturation level. I consider an attack that attempts to overload the amplifier so implausible that I didn’t write any code to detect it, but if you feel motivated, you could easily write some code to look at the data as it comes off the data-acquisition system and verify that it is not saturated. Having 16 bits (or more) of resolution is a great luxury. (It would be quite tricky to verify correct operation on other systems that try to get by with only one-bit or two-bit digitizers.)

Another possibility is interference originating within the computer, such as a switching power supply or a video card, that inadvertently produces a signal that drives the sound card into saturation. This is unlikely, unless something is badly broken. Such interference would cause unacceptable performance in ordinary audio applications, at levels very much lower than we care about. A decent soundcard is, by design, well shielded against all known types of interference. That is, a soundcard must provide good performance at low audio levels (little if any discernible additive noise, no matter what interference sources are present) and also good performance at high audio levels (little if any saturation). The possibility that an interference source that is normally so small as to be inaudible would suddenly become so large as to saturate a 24-bit ADC seems quite insignificant, unless there is a gross hardware failure. The odds here are not dominated by the statisics of the noise processes; they are dominated by the MTBF of your hardware.

Even if there is no interference, it may be that the sound card, or the I/O bus, radiates some signal that is correlated with the raw data that is being provided to the program. However, if you are worried about this sort of passive attack you need to be worried about compromising emanations (TEMPEST) from other parts of your system also. See reference 22. There is no reason to believe the audio hardware or HRNG software unduly increases your vulnerability in this area.

The least-fundamental threats are probably the most important in practice. As an example in this category, consider the possibility that the generator is running on a multiuser machine, and some user might (inadvertently or otherwise) change the mixer gain. To prevent this, we went to a lot of trouble to patch the ALSA system so that we can open the mixer device in “exclusive” mode, so that nobody else can write to it.

### 6.4  Partially Independent Samples

We do not need the samples to be completely independent. All we need is some independence.

Here’s an example that may help explain how this works.

Suppose I select 2500 completely random hex digits and write them on a page of blue paper. The ensemble of similarly-prepared pages has 10,000 bits of entropy. I copy the digits onto a page of red paper, and send you one of the pages.

The page gives you 10,000 bits of information that you didn’t have previously. After you look at the page, the entropy of the ensemble goes to zero.

If I now send you the other page, it conveys zero addtional information. This demonstrates than information and entropy are not extensive variables. This is illustrated in figure 12. A lot of introductory chemistry books will tell you that energy and entropy are extensive, but reality is more complicated – for both energy and entropy – especially for smallish systems where the surface makes a nontrivial contribution.

Figure 12: Information and Entropy are Non-Extensive

What’s even more amusing is that it doesn’t matter which page I send first (red or blue) – the first one conveys information but the second one does not.

It is possible for a source to be partially dependent and partially independent. For example, suppose we shuffle a deck of cards. The ensemble of such decks has slightly more than 225 bits of entropy. That’s log2(52!). As is customary, this assumes we pay attention only to suit and rank, not orientation or anything like that.

Now we take ten copies of that deck, all ordered alike. At this stage they are 100% dependent. Then we “cut the deck” randomly and independently. “Cutting” means applying one of the 52 different full-length permutations. Now, the first deck we look at will provide 225 bits of entropy, and each one thereafter will provide 5.7 bits of additional entropy, since log2(52)=5.7. So in this situation, each deck can be trusted to provide 5.7 bits of entropy.

In this situation, requiring each deck to have no dependence on the others would be an overly strict requirement. We do not need full independence; we just need some independence, as quantified by the provable lower bound on the entropy. To repeat: we don’t need to quantify how much dependence there is; we just need to quantify how much independence there is. In our example, there is provably at least 5.7 bits of usable entropy per deck.

If you wanted, you could do a deeper analysis of this example, taking into account the fact that the entropy we are relying on, 5.7 bits, is not the whole story. We can safely use 5.7 bits as a lower bound, but under some conditions more entropy is available, as can be quantified by considering the joint probability distribution and computing the entropy of that distribution. Meanwhile the fact remains that we are not required to extract every last bit of available randomness. Under a wide range of practical conditions, it makes sense to engineer a random generator based on provable lower bounds, since that is good enough to get the job done, and a deeper analysis would not be worth the trouble.

## 7  Magnitude of Supply and Demand for Entropy

For a typical personal workstation, the demand for high-grade randomness is quite nonuniform. The average demand is rather modest, but the peak demand can be considerably higher. Uses include:

• Seeding and reseeding PRNGs.
• Generating long-term keys and certificates for PGP, ssh, et cetera.
• Generating ephemeral session keys and initialization vectors for SSL, ssh, IPsec, et cetera.

For such uses, I/O timing typically provides a perfectly adequate supply of entropy. The Linux kernel random generator is an example: it collects a few bits per second from timing mouse, keyboard, and disk events.

Servers are more problematic than workstations. Often they have a much greater demand for entropy, and less supply. An example is a server which terminates hundreds of IPsec tunnels. During start-up it requires more than a hundred thousand high-quality random bits in a short time, which is far more than can be buffered in /dev/random, and far more than can be collected by the kernel in the available time.

An even worse situation arises in small networked controllers, which have no mouse, no keyboard, no disk, nor any other good sources of entropy. If we want to establish secure IPsec or ssh connections to such boxes, we simply must come up with a source of entropy.

Similarly, virtual machines need randomness but might not have any good way of obtaining it.

At this point, one often hears the suggestion that we send the box some new entropy over the network. This is not a complete solution, because although it helps defend against certain threats, it leaves others undefended. One problem is that if security is compromised once (perhaps because somebody stole the PRNG seed) then security is compromised more-or-less forever, because the adversary can read the network traffic that carries the new entropy.

The sound card can produce quite a bit of entropy. Even a low-end sound card can produce 44,000 bits per second. A Delta-1010 (which has a higher sample rate, higher resolution, and more channels) can produce a couple million bits per second.

## 8  Hash Function Requirements

### 8.1  Some Counterexamples

In order for the HRNG to work, the hash function be reasonably well-behaved. As an example of what could go wrong, consider a few badly-behaved examples:

• Function WEAKHASH-0 divides the input into 160-bit blocks, and then XORs the blocks together. This would create the risk of rampant hash collisions, since any permutation of the input blocks would produce the same hash. Also a “stuck bit” in the input would result in a stuck bit in the output.
• Function WEAKHASH-1 divides the input into 160-bit blocks, computes the hash of each block separately using a good hash function, and then XORs these blockwise results together. This is better than WEAKHASH-0, because it solves the stuck bit problem, but it still suffers from the fact that any permutation of the input blocks would produce the same hash.
• Function WEAKHASH-2 only puts out odd-parity output words. Any particular bit is entirely predictable from the other 159 bits. That means that half of the C cells in the output-histogram are forever empty. The other cells will receive, on average, twice as many snippets as we would normally have expected. The output of such a hash function would saturate at 159 bits or less, not 160, no matter how much entropy is present at the input.

Weak hash functions have the property that they waste entropy. That is, their outputs are in some way more predictable than the outputs of a strong hash function would be. Obviously for our purposes we want to use a strong hash function.

### 8.2  General Requirements

The hash function cannot possibly create entropy, and we do not need it to do so. Instead, what we need is a hash function that does not waste entropy through unnecessary hash collisions. In the unsaturated regime, there should be very few collisions, so any entropy present at the input is passed to the output without loss. In the saturated regime, when the output approaches 100% entropy density, there will necessarily be collisions and loss of entropy.

Hash functions are used in a wide range of applications. As a first example, let us consider how hash functions are used in compilers: the input to the hash function is a symbol-name, and the output is used as an index into the symbol table. It is often much more efficient to manipulate the hash than to manipulate the corresponding input string.

When two different inputs produce the same hash, it is called a hash collision. This is not desirable from the compiler’s point of view. So, the essential properties of a compiler-grade hash function are:

1) It takes as input a variable-length string, and returns a fixed-length string, called the hash.
2) It is a function; that is, the same input produces the same output every time.
3) Vague anti-collision property: There should not be more hash collisions than necessary.

A cryptologic hash function has several additional properties:

4) Noninvertibility: For any given hash, it is is computationally infeasible to find an input that generates that hash.
5) Cryptologic anti-collision property: It is computationally infeasible to construct two different inputs that produce the same hash.

A typical application for a cryptologic hash is to compute a Hashed Message Authentication Code (HMAC), which can be used to detect and deter tampering with the message. See reference 23.

Property #4 (noninvertibility) only applies if there is a sufficient diversity of possible inputs. If the only possible inputs are “yes” and “no” then the adversary can easily search the input space; this is called a brute-force attack. When computing HMACs one prevents such a search by padding the message with a few hundred bits of entropy. The hash function itself has property #2 but not property #4, while the HMAC has property #4 but not property #2, because HMAC(message) = HASH(padding+message).

When building a High-Entropy Random Generator, we do not require noninvertibility at all. If the adversary knows the current hash-function output, we don’t care if the hash-function input can be inferred, because that has no bearing on any past or future outputs. The generator has no memory. This is one of the great attractions of this HRNG design, because nobody has ever found a function that is provably noninvertible.

In stark contrast, noninvertibility is crucial for any Pseudo-Random Generator. It is easy to arrange that the state variable (the input to the hash function) has plenty of bits, to prevent brute-force searching; that’s not the problem. The problem lies in initializing the state variable, and then defending it for all time. The defense rests on the unproven assumption that the hash function is noninvertible. What’s worse, the burden on the hash function is even heavier than it is in other applications: The hash function must not leak even partial information about its input. Otherwise, since the hash function is called repeatedly, the attacker would be able to gradually accumulate information about the generator’s internal state.

In discussions of this topic, the following notion often crops up:

6) Strict avalanche2 property: whenever one input bit is changed, every output bit must change with probability ½. See reference 24.

This statement is not, alas, strict enough for our purposes. Indeed, WEAKHASH-1 and WEAKHASH-2 satisfy this so-called strict avalanche criterion. To be useful, one would need to say something about two-bit and multi-bit changes in the input. We dare not require that every change in the input cause a change in the output, because there is a simple pigeon-hole argument: There are V = 2562100 input states and only C = 2160 output states. There will be hash collisions; see section 3.4.

### 8.3  Constructing Uniform Hash Functions

To see what criteria are really needed, we will need a more sophisticated analysis. The most general representation of a hash function (indeed any function) is as a function table. Table 2 shows a rather weak hash function.

 Row Input Output 1 () (01) 2 (0) (10) 3 (1) (11) 4 (00) (00) 5 (01) (01) 6 (10) (10) 7 (11) (11) 8 (000) (00) 9 (001) (01) 10 (010) (10) 11 (011) (11) 12 (100) (00) 13 (101) (01) 14 (110) (10) 15 (111) (11)
Table 2: Function Table: Weak Hash Function

The input symbols are delimited by parentheses to emphasize that they are variable-length strings; row 1 is the zero-length string. It should be obvious how to extend the table to cover arbitrary-length inputs. It should also be obvious how to generalize it to larger numbers of hashcodes.

The hashcodes in the Output column were constructed by cycling through all possible codes, in order. All possible codes occur, provided of course that there are enough rows, i.e. enough possible inputs. What’s more, all possible codes occur with nearly-uniform frequency, as nearly as can be. This construction produces exactly uniform frequency if and only if the number of rows is a multiple the number of possible codes, which we call the commensurate case. The example in the table is non-commensurate, so even though the codes are distributed as evenly as possible, complete evenness is not possible. In this example, the (00) code is slightly underrepresented.

Table 2 is a hash function in the sense that it maps arbitary-length inputs to fixed-length outputs, but it has very poor collision-resistance, and it has no vestige of one-wayness.

Starting from table 2 can construct a better hash function by permuting the rows in the Output column, as shown in table 3. If there are R rows in the table and C possible codes, there are V! / (V/C)!C distinct permutations. This is exact in the commensurate case, and a lower bound otherwise. This is a huge number in practical situations such as V = 2562100 and C = 2160.

 Row Input Output 1 () (01) 2 (0) (10) 3 (1) (10) 4 (00) (00) 5 (01) (11) 6 (10) (01) 7 (11) (01) 8 (000) (11) 9 (001) (10) 10 (010) (00) 11 (011) (11) 12 (100) (00) 13 (101) (11) 14 (110) (01) 15 (111) (10)
Table 3: Function Table: Improved Hash Function

If we temporarily assume a uniform probability measure on the set of all possible permutations, we can choose a permutation and use it to construct a hash function. Statistically speaking, almost all hash functions constructed in this way will have very good anti-collision properties and one-way properties (assuming V and C are reasonably large, so that it makes sense to make statistical statements).

This way of constructing hash functions is more practical than it might initially appear. Note that any cipher can be considered a permutation on the space of all possible texts of a given length. Hint: use a pigeon-hole argument: The ciphertext is the same length as the plaintext, and the fact that the cipher is decipherable guarantees that there is a one-to-one correspondence between ciphertexts and plaintexts.

This suggests a general method of constructing hash functions: Pad all intputs to some fixed length. Append a representation of the original length, so that we don’t have collisions between messages that differ only in the amount of padding required. Encipher. If it’s a block cipher, use appropriate chaining so that the resulting permutation isn’t a block-diagonal matrix. Pass the cipertext through a contraction function such as WEAKHASH-0 to produce a fixed-length output. That is:

 Pad → Encipher → Contract                              (36)

For starters, this serves to answer a theoretical question: it is certainly possible for a hash function to generate all codes uniformly.

By the way, a hash constructed this way will have the one-way property, if the adversaries don’t know how to decipher your cipher. You could ensure this by using a public-key cipher and discarding the private key. (For what it’s worth, if you don’t need the one-way property, you could even use a symmetric cipher without bothering to keep the key secret.)

Another way to add the one-way property, for instance for a PRNG or SRNG, is to tack on a one-way function downstream of the cipher-based hash function. This may be more computationally efficient than using asymmetric crypto. A cipher with a random key will suffice. If the one-way function requires a key, consider that to be part of the keying and re-keying requirements of the PRNG or SRNG.

In any case, keep in mind that turbid does not require its hash to be one-way.

We remark that many modern hash functions including SHA-256 follow the general plan PadScrambleContract which is rather similar to scheme 36. A key difference is that the scramble function is not a normal cipher, because it is not meant to be deciphered. The Feistel scheme is conspicuously absent. Therefore we can’t easily prove that it is really a permutation. The pigeon-hole argument doesn’t apply, and we can’t be 100% sure that all hashcodes are being generated uniformly.

### 8.4  Assigning Probability to the Inputs

In section 8.3 we temporarily assumed a uniform distribution on all possible truth-tables. That’s not quite realistic; the mixing performed by any off-the-shelf hash function is not the same as would be produced by a randomly-chosen function table. We need to argue that it is, with virtual certainty, impossible for anyone to tell the difference, even if the adversaries have unlimited computing power.

Rather than just assuming that the hash function has the desired property, let’s look a little more closely at what is required, and how such a property can come about.

In section 8.3 each row in the Input column was treated pretty much the same as any other row. It is better to assign measure to each input-state according to the probability with which our data-acquisition system generates that state.

We then sort the truth-table according to the probability of each input. (This sort keeps rows intact, preserving the Input/Output relationship, unlike the permutation discussed previously which permuted one column relative to the others.) In the Input column we list all the states (all 2562100 of them) in descending order of probability. In the Output column we find the corresponding hash. This allows us to see the problem with WEAKHASH-1: Because we are imagining that the data is an IID sequence of samples, with much less than 100% adamance density, a huge preponderance of the high-probability data strings will have the same hash, because of the permutation-invariance. Certain hashcodes will be vastly over-represented near the top of the list.

So we require something much stronger than the so-called strict avalanche property. Mainly, we require that anything that is a symmetry or near-symmetry of the raw data must not be a symmetry of the hash function. If single-bit flips are likely in the raw data, then flipping a single bit should move the data-string into a new hashcode. If flipping a few bits at a time is likely, or permutations are likely, then such things should also move the data-string to a new hashcode.

The input buffer will typically contain some bits that carry little or no entropy. In the space of all possible function tables, there must exist some hash functions that look at all the wrong bits. However, these must be fantastically rare. Consider the cipher-based hash functions decribed previously. For almost any choice of cipher key, the function will permute the hashcodes so that a representative sample falls at the top of the list. The key need not be kept secret (since the adversaries have no power to change the statistics of the raw data).

Unless we are extraordinarily unlucky, the symmetries of natural physical processes are not symmetries of the hash function we are using.

Let us contrast how much a cryptologic hash function provides with how little we need:

 A cryptologic hash function advertises that it is computationally infeasible for an adversary to unmix the hashcodes. What we are asking is not really very special. We merely ask that the hashcodes in the Output column be well mixed.

 A chosen-plaintext (chosen-input) attack will not discover inputs that produce hash collisions with any great probability. We ask that the data acquisition system will not accidentally produce an input pattern that unmixes the hashcodes.

We believe that anything that makes a good pretense of being a cryptologic hash function is good enough for our purposes, with a wide margin of safety. If it resists attack when the adversary can choose the inputs, it presumably resists attack when the adversary can’t choose the inputs. See also section 8.5.

Note that we don’t care how much the adversary knows about the distribution from which the input samples are drawn. We just need to make sure that there is enough sample-to-sample variability. If there is some 60-cycle hum or other interference, even something injected by an adversary, that cannot reduce the variability (except in ultra-extreme cases).

Once noise is added to a signal, it is very hard to remove — as anyone who has ever tried to design a low-noise circuit can testify.

### 8.5  A Simple Test of the Hash Function

It is often suggested that we test whether the output of turbid is random, using packages such as Diehard (reference 25) and Maurer’s Universal Statistical Test (“MUST”, reference 21). We have several things to say in response:

1) SHA-256 does pass those tests.
2) SHA-256 passes an even stricter test, namely the counter test described below.
3) Being able to pass such a test is necessary but far from sufficient as a criterion for the correctness of the turbid program. See section 4.

We note in passing a trivial weakness: There is a wide category of hash functions (including SHA-256), each of which operates on blocks of input data, and are Markovian in the sense that they remember only a limited amount of information about previous blocks. It has been shown that any hash in this category will be more vulnerable to multicollisions than an ideal hash would be. However, this is a chosen-plaintext attack, and is not even remotely a threat to turbid.

SHA-256 has been cryptanalyzed by experts; see e.g. reference 26 and reference 27. To date, no attacks have been reported that raise even the slightest doubts about its suitability for use in turbid.

However, bowing to popular demand, we performed some tests, as follows: We constructed a 60-bit counter using the LSB from each of 60 32-bit words; the higher 31 bits of each word remained zero throughout. These sparsely-represented 60-bit numbers were used, one by one, as input to the hash function. The resulting hashcodes were taken as the output, analogous to the output of turbid. We then applied Diehard and MUST.

As expected, SHA-256 passed these tests. Even the older, weaker SHA-1 passed these tests. Such tests are more likely to catch coding errors than to catch weaknesses in the underlying algorithm, when applied to professional-grade algorithms.

In relative terms, we consider this counter-based approach to be in some ways stricter than testing the output of turbid under normal operating conditions, because the counter has zero entropy. It has a goodly number of permanently-stuck bits, and the input to the hash function changes by usually only one or two bits each time.

However, in absolute terms, we consider all such tests ludicrously lame, indicative of the weakness of the tests, not indicative of the strength of SHA-256. We don’t want turbid to be judged by such low standards. See section 4 for more on this.

## 9  Stretched Random Generator

The turbid program comes bundled with a Stretched Random Generator. Its adamance density is strictly greater than zero but less than 100%. To use it, read from /dev/srandom.

The SRNG seeds itself, and reseeds itself every so often, using entropy from the HRNG. Reading from /dev/srandom consumes entropy 10,000 times more sparingly than reading from /dev/hrandom or /dev/random or /dev/urandom. It is also less compute-intensive than /dev/urandom.

The principle of operation is as follows: In addition to the seeding/reseeding process, there is a state variable, a deterministic update function, and a hash function. Between reseedings, at each iteration we deterministically update the state variable and feed it to the hash function. The output of the hash function is the output of the SRNG.

The update function treats the state variable as three 32-bit registers. Each register has a random starting point. The first register is updated by adding a random odd addend. The second is a Linear Feedback Shift Register. The third is another LFSR, with a different period. (This design was chosen because each register can be separately tested, in contrast to one big 96-bit LFSR which would be harder to test.) Each register separately has a period on the order of 232 and combined they give a period on the order of 296. The period is billions upon billions of times longer than it needs to be, because after 10,000 iterations the state variable is strongly perturbed using 128 bits of pure entropy from the HRNG.

It is commonly assumed that having a super-long period is a hallmark of a good PRNG. However, that’s neither necessary nor sufficient. A 4000-bit LFSR has a period of 24000 or so, but it can easily be cryptanalyzed on the basis of only 4000 observations (unless it is protected by a strong one-way function installed between the LFSR and the observable output). It is necessary to have a period long compared to the interval between reseedings, but it is not necessary to make it much longer than that. In summary, if the period is long enough, making it longer confers no advantage.

Unlike the HRNG, the SRNG must assume that the hash function has the one-way property. That is, we assume an adversary (even after ∼10,000 observations of the output of the hash function) cannot infer anything useful about the state variable.

The defining property of a SRNG is that it has an average entropy density that is greater than zero but less than 100%. Let’s take a moment to discuss where in the output-stream this entropy is to be found. Suppose you have 1024 bits of entropy available for seeding your generator, which you then use to put out a very long output string.

• For one typical class of PRNGs, which includes Linux /dev/urandom and excludes yarrow, the first 1024 bits of output will use up all the entropy, in the sense that the mapping from seeds to output-strings will be very nearly one-to-one. To say the same thing another way, suppose the adversaries are trying to find the seed by brute-force search of seed-space. If they’ve guessed the wrong seed they’ll know it with very high probability after seeing the first 1024 bits of output. They could also wait and know the same thing from the second 1024 bits of output, so we can have metaphysical arguments about just where the entropy “really” resides ... but since the adversaries get to see the output stream in order, our minimax strategy is to assume they are not stupid and have used the first 1024 bits to reduce their entropy. To summarize: when /dev/urandom decrements its “pool entropy” number by one bit per bit of output, that is the sensible minimax thing to do.
• In contrast, yarrow might well be configured to use only 128 bits of the available entropy to generate the first 1,600,000-bit block of output, and then use the second 128 bits for the second 1,600,000-bit block of output. This leads to a nice notion of average adamance density. However, this is only valid as a long-term average; by the previous minimax argument you could consider the entropy to be front-loaded in the first 128 bits of each output block.

## 10  Context, History and Acknowledgments

For a discussion of the Fortuna class of random generators, and how that contrasts with turbid, see reference 28.

For a discussion of the Linux device driver for /dev/random and /dev/urandom, see reference 20.

There is a vast literature on hash functions in general. Reference 24 is a starting point. Reference 29 is a useful literature survey. Reference 30 was an influential early program for harvesting entropy from the audio system; it differs from turbid in not having calibration features. Reference 31 uses the physics of a spinning disk as a source of entropy.

Methods for removing bias from a coin-toss go back to von Neuman; see reference 32 and references therein; also reference 33.

Reference 34 suggests hashing 308 bits (biased 99% toward zero) and taking the low-order 5 bits of the hash to produce an unbiased result. This is correct as far as it goes, but it is inefficient, wasting 80% of the input entropy. Presumably that reflects an incomplete understanding of the hash-saturation principle, in particular the convergence properties discussed in section 3.4, which would have led immediately to consideration of wider hashcodes. There is also no attempt to quantify how nearly unbiased the result is.

Thanks to Steve Bellovin, Paul Honig, and David Wagner for encouragement, incisive questions, and valuable suggestions.

## 11  Conclusions

Practical sources of randomness can be characterized in terms of the adamance density. The raw sources have an adamance density greater than zero and less than 100%. The methods disclosed here make it possible to produce an output that approaches 100% adamance density as closely as you wish, producing symbols that are random enough for any practical purpose. It is important to have a calculated lower bound on the adamance density. In contrast, statistical tests provide only an upper bound, which is not what we need.

It is possible to generate industrial-strength randomness at very low cost, for example by distilling the randomness that is present in ordinary audio interfaces.

## A  The Definition of Randomness and Surprisal

### A.1  Introduction and Intuitive Examples

In order to refine our notions of what is random and what is not, consider the following million-character strings. We start with

Example E1
: “xxxxxxxx...xx”      (a million copies of “x”)

That seems distinctly non-random, very predictable, not very complex. Next, consider the string

Example E2
: “xyxyxyxy...xy”      (half a million copies of “xy”)

That seems also non-random, very predictable, not very complex.

Example E3
: “31415926...99”      (the first million decimal digits of π/10)

That is somewhat more interesting. The digits pass all known tests for randomness with one narrow exception (namely tests that check for digits of π). However, they must still be considered completely predictable. Finally, consider the two strings

Example E4
: “AukA1sVA...A5”
Example E5
: “Aukq1sVN...G5”

The million-character string represented by example E5 is, as far as anybody knows, completely random and unpredictable. Example E4 is very similar, except that the letter “A” appears more often than it would by chance. This string is mostly unpredictable, but contains a small element of nonrandomness and predictability.

### A.2  Definitions

Following Solomonoff (reference 35 and reference 36) and Chaitin (reference 37) we quantify the surprisal of a string of symbols as follows:

Let z be a string of symbols. The elements of z are denoted zk and are drawn from some alphabet Z. The number of symbols in the alphabet is denoted #Z.

Let PC(z) be the probability that computer programs, when run on a given computer C, will print z and then halt. The surprisal is defined to be the negative logarithm of this probability, denoted \$C(z) := − logPC(z). In this document we choose to use base-2 logarithms, so that surprisal is measured in bits. Surprisal is also known as the surprise value or equivalently the unexpectedness. It is intimately related to entropy, as discussed below.

Note that we are using one probability distribution (the probability of choosing a computer program) to derive another (the probability PC(z) of a symbol string). To get this process started, we need to specify a method for choosing the programs. Better yet, using the notions of measure theory, we realize that probability need not involve choosing at all; instead all we really need is a method for assigning weight (i.e. measure) to each of the possible programs. Here is a simple method:3 Consider all possible programs of length exactly L* (measured in bits, since we are representing all programs as bit-strings). Assign each such program equal measure, namely 2L*. Then test each one and count how many of them produce the desired output string z and then halt, when run on the given computer C. The measure of a string is the total measure of the set of programs that produce it. A string that can be produced by many different programs accumulates a large probability.

In this construction, a short program X (i.e. one which has a length L(X) less than L*) is not represented directly, but instead is represented by its children, i.e. programs formed from X by padding it with comments to reach the required length L*. There are at least 2L*L(X) ways of doing the padding, and each way contributes 2L* to the total measure. This means that a string z that can be produced by a short programs X will have a probability at least 2L(X), no matter how large L* is. We take the limit as L* becomes very large and use this in the definition of PC(z).

Usually the probability PC(z) is dominated by (the children of) the shortest program that produces z. Therefore some people like to use the length of this shortest program as an estimate of the surprisal. This is often a good estimate, but it should not be taken as the definition.

The explicit dependence of PC(z) on the choice of computer C calls attention to an element of arbitrariness that is inherent in any definition of surprisal. Different people will assign different values to “the” surprisal, depending on what resources they have, and on what a priori information they have about the situation.

In an adversarial situation such as cryptography, we suggest that probabilities be defined in terms of the adversary’s computer. If you have multiple adversaries, they can be treated as a single adversary with a parallel computer.

### A.3  Properties of the Surprisal

In this section, for conciseness, we drop the subscript C ... but you should remember that the probabilities and related quantities are still implicitly dependent on the choice of C.

Upper bound:   If z is a string of bits, the surprisal \$(z) can never exceed the length L(z) by more than a small amount, because we can write a program that contains z and just prints it verbatim. More generally, if z is a string of symbols, \$(z) can never exceed L(z) log(#Z) by more than a small amount. Again, that’s because we can write a program that contains a literal representation of z and just prints it verbatim.
Tighter Upper Bound:   The surprisal cannot exceed the length of the compressed representation4 of z by more than a bounded amount, because we can write a program that contains the compressed representation and calls the uncompress utility.
Lower Bound:   With very few exceptions, the surprisal \$(z) of any string z cannot be not much less than L(z) log(#Z). You can see why this must be so, using a simple pigeon-hole argument: Consider all bit-strings of length 500, and suppose 1% of them have entropy less then 490 bits. Well, we are talking about 1% of 2500 strings, each having probability 2−490 — which adds up to more than 100% probability. Oops.

### A.4  Limitations

A relatively minor objection to this definition of surprisal is that PC(z) includes contributions from arbitrarily-long programs. That is a problem in theory, but in practice the sum is dominated by relatively short programs, and the limit converges quickly.

A much more serious objection is that even for modest-sized programs, the definition runs afoul of the halting problem. That is, there may well be programs that run for a million years without halting, and we don’t know whether they would eventually produce string z and then halt. This means the surprisal, as defined here, is a formally uncomputable quantity.

We will duck both these issues, except for the following remarks.

• Any compressed representation that you happen to find for z puts a lower bound on PC(z) and an upper bound on the surprisal \$(z).
• There do exist compression algorithms that are computable (indeed efficiently computable) and are effective on certain types of signal. These algorithms, as a rule, make use of specialized knowledge about the type of signal (text, voice, music, video, et cetera).

### A.5  Surprise Density

In this section we shift attention from the unpredictability of strings to the unpredictability of the individual symbols making up the string.

Let us re-examine the examples given at the beginning of this section. Example E5 has surprisal \$(z) very close to L(z) log(#Z). We classify strings of this sort as absolutely-random, by which we mean algorithmically-random.

Examples E1, E2, and E3 all have surprisal much less than their length. These strings are clearly not absolutely-random.

The interesting case is example E4. Intuitively, we think of this as “somewhat unpredictable” but more predictable than E5. To make this notion quantitative, we introduce the notion of surprise density. The following quantity tells us the surprise density of the string z in the region from i to j:

σj|i(z) :=
 \$(Front(z, j)) − \$(Front(z,i)) j−i
(37)

where Front(z,i) is the substring formed by taking the first i symbols of the string z.

The surprise density for examples E1, E2, and E3 is zero for any region not too near the beginning. The surprise density for example E5 is as large as it possibly could be, namely 100% of log(#Z). Example E4 illustrates the interesting case where the surprise density is quite a bit greater than zero, but not quite 100% of log(#Z).

As mentioned in section 1.2, we consider the unadorned term “random” to be ambiguous, because it means different things to different people. Some people think “random” should denote 100% surprise density, and anything less than that is “non-random” even if it contains a great deal of unpredictability. Other folks think that “random” refers to anything with an surprisal density greater than zero, and “non-random” means completely predictable. Yet other folks extend “random” to include pseudo-random strings, as long as they are “random enough” for some application of interest. Even some professionals in the field use the word this way; reference 38 and the more recent reference 39 speak of “deterministic random number generators”, although it could be argued that it is impossible in principle for a process to be both deterministic and random. The randomness of a PRNG comes from the seed. Calling a PRNG “deterministic” seriously understates the importance of the seed. Compare reference 40.

In the space of all possible strings, almost all strings are absolutely-random, i.e. are algorithmically-random, i.e. contain 100% surprise density. However, the strings we find in nature, and the strings we can easily describe, all have very little surprise density. We can summarize this by saying: “All strings are absolutely-random, except for the ones we know about”.

We can use similar ideas to describe PRNGs and contrast them with HRNGs.

 Most of modern cryptography revolves around notions of computational intractability. (I’m not saying that’s good or bad; it is what it is. In contrast, there is another set of notions, including entropy and the related notion of unicity distance, that have got nothing to do with computational intractability.

 The long strings produced by a good PRNG are conditionally compressible in the sense that we know there exists a shorter representation, but at the same time we believe them to be conditionally incompressible in the sense that the adversary has no feasible way of finding a shorter representation. The long strings produced by a HRNG are unconditionally, absolutely incompressible. Specifically, the set of strings collectively cannot be compressed at all. As for each string individually, it is not compressible either, except possibly for trivially rare cases and/or trivial amounts of compression.

## B  Measures of Randomness

### B.1  Basics

In this section we climb down a bit from the formality of the previous section.

Let C be a machine that endlessly emits symbols from some alphabet Z. We assume the symbols are IID, that is, independent and identically distributed. That means we can calculate things on a symbol-by-symbol basis, without worrying about strings, the length of strings, or any of that. Let PC(i) denote the probability of the ith symbol in the alphabet.

Note: Nothing is ever exactly IID, but real soundcards are expected to be very nearly IID. At the least, we can say that they are expected to have very little memory. What we require for good random generation is a very mild subset of what is required for good audio performance.

The surprisal of a given symbol is

 \$C(i) := log(1/PC(i))              (38)

By extension we can talk about the probability of a string of symbols. Since the symbols are IID, the probability of a string z is:

PC(z) :=
 ∏ k
PC(zk)              (39)

where zk is the symbol in the kth position in the string. As an immediate corollary,

\$C(z) :=
 ∑ k
\$C(zk)              (40)

which just says that the surprisal is additive in a nice way.

The source entropy of the machine C is defined to be:

SC :=
 ∑ i
PC(i) log(1/PC(i))              (41)

Let’s be clear: The surprisal is a property of the symbol, while the entropy is a property of the source. In particular, the entropy is the appropriately-weighted average surprisal.

For a long string, we expect the surprisal of the string to converge to the entropy of the source times the length of the string. The probability of this happening converges to 1, as the string gets longer and longer.

Beware: In the literature it is common to see the word “entropy” used in places where some other quantity would make more sense. In terms of Rényi functionals, sometimes H1 (the plain old entropy) is appropriate, sometimes H is appropriate, and sometimes both or neither.

As a partially-related point: The Linux /dev/random driver keeps track of the “amount of entropy” in the “pool” (i.e. buffer). That idea is in the right general ballpark, but some remarks are in order:

• As always, entropy is a property of the distribution, in this case, an ensemble of similarly-prepared pools. The code is not at all clear about this crucial conceptual point. Adding the word “ensemble” would be an improvement. The attacker doesn’t know the bit pattern in the pool, so he has to worry about the ensemble.
• Do not think you can look at the bit pattern in the pool and figure out how much entropy there is. The code doesn’t do this, and you should not either. In other words, surprisal is not entropy. Really, really not.
• What the code actually does is different from what the comments say it does ... and from what the symbol-names imply it does.
• The driver is completely at the mercy of upstream entropy providers. It depends on them to provide entropy, and to correctly say how much entropy is being provided.
• It’s not at all clear that entropy is the only thing that matters. In particular, depending on the threat model, the Rényi functional H might be more appropriate than the plain old entropy, as discussed in section B.2. Some people call H a form of “generalized entropy” but that is misleading in the extreme, and completely undocumented in the /dev/random driver. It is wildly unsafe to assume that the upstream “entropy” providers are using H when they calculate the required “entropy” estimate.

### B.2  Skewed Distributions

In general the attacker can work in a two-dimensional space. The variables are how much does it cost to break a message of a particular type, and the probability of seeing such a message. The attacker can sit wherever he likes in this space, depending on whether he wants a quicker break or a cheaper break. He can make the tradeoff however he likes.

Let’s analyze this from the sender’s point of view, to see how he would defend against this. We make the task more challenging by assuming that a great many messages are sent. The attacker can apply his best guess to each message, and then move on to the next message. He never needs to bother with his second-best guess.

The sender can draw samples from the orignal (skewed) distribution, and can concatenate them to produce a new (less-skewed) distribution. Subject to the given threat model, the thing we care about is entirely dominated by the most-likely element of the distribution. That is, the distribution in figure 5 there is only 1 bit of adamance per sample, even though there are 4 bits of entropy per sample.

The source entropy S in equation 2b is a single number that characterizes the “average” behavior of the source. For a small sample from a highly-skewed distribution, S doesn’t tell you everything you need to know. This doesn’t change the definition of entropy; entropy is always defined by equation 2b. It just means that equation 2b by itself doesn’t solve all the world’s problems.

Within the given threat model, we are better off looking at H, the Rényi functional of order α. It is numerically equal to the surprisal of the most-probable element of the distribution.

In practice, turbid doesn’t have too much of a problem with skewed distributions, for two reasons: First of all, the raw data from a typical soundcard is not terribly skewed. It could perhaps be skewed if the card is producing much less than 1 bit of entropy per sample ... but such cards produce so little randomness that your best strategy is to go buy a better card.

## C  Adamance and Entropy of a Digitized Gaussian

### C.1  Two-Level Digitizer

As a warm-up exercise, let’s consider the case of a one-bit digitizer – i.e. a comparator – applied to Gaussian noise. The probabilities are shown in figure 13. Assuming the threshold of the comparator is fixed at zero, the any input in the reddish-shaded region will result in a “low” output, while any input in the bluish-shaded region will result in a “high” output.

 Figure 13: Shifted Gaussian with Comparator Figure 14: Adamance and Entropy from a Comparator

Assuming the samples are IID, the entropy per sample coming out of the comparator is

 Scomparator(µ/σ) = −(P) log(P) − (1−P) log(1−P)
(42)

where

P =
½erf(
 −µ σ√2
)
(43)

A plot of equation 42 is shown by the blue curve in figure 14. It should be obvious by symmetry that the entropy is one bit per sample when the Gaussian is centered at zero. The case where µ/σ=0.5 (the case illustrated in figure 13), the entropy is 0.79 bits.

Minor tangential remark: We know by looking at its definition that the entropy-versus-offset curve cannot possibly be a Gaussian, but it is astonishingly close.

As always, turbid is designed around provable lower bounds on the entropy. Since we cannot be certain where the Gaussian sits relative to the nearest riser, we assume the worst case, namely that it sits in the middle of a tread. In practice this doesn’t cost us much, because it produces a fairly tight lower bound. The bound stops being tight when σ/Q is quite small (less than 0.2 or so), which only happens when the soundcard has too little sensitivity for practical use.

Example: The Line-In port on the Thinkpad 600 is woefully lacking in sensitivity, and its impedance and bandwith are unimpressive also. As a consequence, it has a rather narrow Gaussian which could easily sit entirely (except for negligible wings) in the middle of a single tread, so we cannot derive a useful lower bound on the entropy. Although intuition suggests that the signal actually contains some entropy, we can’t prove it, so we give up on this port. We use the Mike-In port instead, which has vastly more sensitivity.

### C.2  Three-Level Digitizer

Let’s take the next step in developing the ideas. the A-to-D system has three output codes, not just two. Let Q denote the spacing between codes. In other words, Q is the width of a tread.

Terminology: Figure 15 shows the transfer function for an ideal voltmeter. A function that looks like this is called a staircase function, for obvious reasons. In this example, the resolution is 1 microvolt. Any analog input in the open interval (−2.5, 2.5) microvolts will produce a digital reading of 2 microvolts. This corresponds to a horizontal piece of the transfer function. Any such horizontal piece is called a tread, borrowing a term from the architecture of real staircases.

Meanwhile, in the graph, the vertical line that connects one tread to the next is called a riser. The riser is not part of the definition of the function, but we can draw it anyway, as a guide to the eye. More precisely, we will use the same word – riser – to refer to the input voltage where the output changes discontinuously from one code to another. At such points, we care more about the input voltage than the output code. No matter how (or whether) you define the transfer function at this point, it will be discontinuous at each riser.

Figure 16 shows the probabilities, assuming Q=1 for simplicity. (In practice Q is on the order of a nanovolt or a microvolt, not a whole volt.)

 Figure 16: Gaussian with 3 Codes Figure 17: Adamance and Entropy from 3 Codes

The equation for the entropy now has three terms, not just two:

 S3(µ, σ) = −a log(a)   −b log(b)   −c log(c)
(44)

where

a :=
½erf(
 −0.5 − µ σ√2
)
½erf(
 −∞ σ√2
)
b :=
½erf(
 +0.5 − µ σ√2
)
½erf(
 −0.5 − µ σ√2
)
c :=
½erf(
 +∞ σ√2
)
½erf(
 +0.5 − µ σ√2
)
(45)

For a reasonably wide Gaussian, using three codes allows us to obtain more than one bit per sample. The best case is when the Gaussian is centered in the middle of a tread, as you can see in figure 17.

We now consider the case of a narrow Gaussian, being centered on a tread is the worst case, from the defenders’ point of view, as you can see in figure 19. Only the one code gets generated with any appreciable probability. If we shift the narrow Gaussian so that it straddles a riser, we know we can get one bit per sample, as discussed in section C.1.

 Figure 18: Gaussian with 3 Codes, Narrow Figure 19: Adamance and Entropy from 3 Codes, Narrow

Nobody in the real world uses a three-code digitizer. However, it was worth deriving equation 44, because it provides a three-term approximation that is useful whenever the voltage distribution is narrow. That’s because even if the digitizer has many levels, when the distribution is narrow, only three of the codes make any significant contribution.

We can examine this as a function of σ. When the Gaussian is centered on a riser, when σ is near 1.1608, the three-code digitizer does as well as it could possibly do. That’s the point where ½erf(0.5/σ √2) is equal to one third. So all three codes are generated with equal probability. The horizontal black reference line in figure 20 represents one trit. That’s the same as log2(3) bits, i.e. just less than 1.6 bits.

At this special point, when all three codes are equiprobable, the adamance is equal to the entropy.

Figure 20: Adamance and Entropy versus σ

When σ is half of Q, we are still getting about 1 bit per sample, which is not great, but not terrible. Even when σ is 1/4 of Q, and the Gaussian is centered on a tread, we get 0.3 bits per sample, but things go rapidly to pot after that. When σ is 1/8th of Q, we are getting only 0.001 bits per sample, which is usually not worth the trouble.

In principle, if the Gaussian were centered on a riser, it would produce one bit of entropy and one bit of adamance per sample, even when the width σ was very small. However, it would not be easy to guarantee that the Gaussian was always sufficiently well centered.

Fortunately, it is easy to find soundcards where the RMS voltage is very large compared to Q. In that case, the three-term approximation discussed in this section is not valid and not needed.

### C.3  Multi-Level Digitizer

Let’s consider the case where the digitizer produces thousands of different codes, and the input distribution is very wide compared to the width of a tread. Let the width of each tread be denoted Q. Equation 44 generalizes as follows:

s=
 ∞ ∑ i=−∞
[−logPiPi
where
Pi=
½erf
 (i+½)Q σ√2
− ½erf
 (i−½)Q σ√2

=
 (i+½)Q (i−½)Q
 1 √(2πσ2)
exp
 −V2 2σ2
dV
(46)

There will be a discrete amount of probability Pi associated with each tread, i.e. with each digitizer code. The situation is shown in figure 21. The highlighted area represents one of the Pi terms that contributes to the sum in equation 46.

Figure 21: Gaussian With Many Codes

In the limit where the width of the distribution σ is large compared to the digitizer step Q, the Pi inside the logarithm in equation 46 can be simplified. We approximate the integral by height times width.

s=
 ∞ ∑ i=−∞
[log
 σ√(2π) Q
 i2 Q2 2σ2
log(e)] Pi
(47)

The explicit factor of log(e) is necessary, because the logarithms in these equations could use any base. [If I had wanted to specify the natural logarithm, I would have written ln() instead of log().] In particular, you might well want to use base-2 logarithms, so that the entropy comes out measured in bits.

The first term inside the square brackets in equation 47 is independent of i, so it can be summed immediately.

To handle the other term, we make another approximation. In the limit, the V inside the integral is never very different from i2Q2, so we can bring the latter inside the integral. So at this point we have

s =
log
 σ√(2π) Q
+
log(e)

 1 √(2πσ2)
 (i+½)Q (i−½)Q
 V2 2σ2
exp
 −V2 2σ2
dV

=
log
 σ√(2π) Q
+
log(e)

 1 √(2πσ2)
 ∞ −∞
 V2 2σ2
exp
 −V2 2σ2
dV

(48)

The definite integral inside the big square brackets is easy to do. It is equal to 1/2. So, all that work leaves us with a remarkably simple expression for the entropy per sample:

 s = log(σ / Q) + ½ log(2πe)
(49)

The first term is easy to understand in terms of dimensional analysis – just ask how many bins fall under the high-probability part of the Gaussian. The second term is just a constant (2.05 bits).

I checked the accuracy of the analytic approximation in equation 49 by comparison to the exact result, summing the series given by equation 46. It was, as expected, extremely accurate when σ/Q is large. It is even surprisingly decent when σ/Q is not large. Also it has the nice property of erring to the low side, in keeping with the turbid design philosophy of not promising more than we can deliver.

 Direct Analytic Sum Approx σ/Q s / bits s / bits Δ 1 2.10 2.05 <3% 1.5 2.66 2.63 <1% 2 3.06 3.05 <0.5% 4 4.051 4.047 <0.1%

For larger, σ/Q, the accuracy gets better and better. The situation is shown in figure 22. The solid blue curve is the exact result, obtained by direct evaluation of many terms of equation 46. You can easily tell when you have enough terms by checking that ∑Pi=1.

Figure 22: Adamance and Entropy of a Digitized Gaussian

The dotted blue line in the figure is the analytic approximation, i.e. equation 49. For σ/Q larger than about 1, it is hard to distinguish from the exact result. For smaller values of σ/Q, a few terms of the exact expression will give you a good approximation. Note that by symmetry, when evaluating the so-called N-term approximation, you only need to evaluate (N+1)/2 terms.

The adamance calculation is even easier. We assume the Gaussian is centered on a tread, which is the worst case (from the defenders’ point of view). That gives us:

 H∞ = −log(P)
(50)

where

P =
[½ + ½erf(
 0.5 σ √2
)]

[½ + ½erf(
 −0.5 σ √2
)]
=
erf(
 1 σ 2√2
)
(51)

For not-very-large σ, it is best to use equation 51 directly. For large σ, the argument to the erf is small, and we can obtain some insight by using a Taylor series. That’s easy, because the derivative of erf is well known. This is equivalent to approximating the probability of the most-probable code by multiplying its height times its width (on the probability density distribution plot). Either way, we get

P
 1 √(2π)

 1 σ
exp(0)
(52)

so

 H∞ ∼ log(σ) + ½log(2π)
(53)

where the logarithm can have whatever base you prefer. This is shown by the dotted red line in figure 22.

We see that the adamance curve is offset below the entropy curve by log2(e) = 1.44 bits. This means that we can characterize the raw input in terms of adamance with no regrets. Using entropy instead would give us slightly higher throughput, but only slightly, at the cost of a less trustworthy claim of correctness.

Keep in mind that when σ/Q is small, your efforts would probably be better spent getting a better soundcard, rather than micro-analyzing a lousy card.

## D  Noise in Voltage Dividers

For some perspectives on Johnson noise and the Nyquist formula, including what happens when there is a frequency-dependent gain, see reference 41.

This has some interesting implications and ramifications for our randomness generator. There are a couple of regimes to consider:

• Suppose you have a sound card where the input impedance consists of a nice small capacitance (high Z) and a not-so-nice smallish resistance (low Z), such that the corner frequency 1/(2πRC) is high compared to your digitizer’s usable bandwidth. In this regime you don’t care about the capacitance value; the impedance of the input is real (i.e. non-reactive) across the relevant part of the spectrum. The amount of noise on the input is determined by the resistance. The noise is not large (because the resistance is small), but at least it’s nice and white, i.e. independent of frequency.

In this regime, assuming the goal is to collect noise samples, you don’t want to add any additional resistance in parallel with what’s already there. That would just move the flat part of the curve to a lower voltage. (It would also push the corner frequency out to a higher frequency, but that doesn’t do any good if the digitizer doesn’t go that high.)

• In contrast, now suppose you have a sound card where the input impedance consists of a significant capacitance (low Z) and a relatively high resistance (high Z), such that corner frequency is low compared to your digitizer’s usable bandwidth. (I’ve seen examples of this.) The impedance is reactive across a big part of the relevant spectrum. This is tricky, because the noise is not white. It is pink. Successive samples of the voltage will be somewhat correlated.

There are various ways of dealing with this.

• One possibility is to decorrelate the signal by postprocessing it. You could do a deconvolution using Fourier methods, or you could build a digital filter that implements the inverse of the RC filter.
• Another possibility is to accept the signal as-is, and use its conditional entropy as our source of entropy. That is, recognize that each sample has some conditional probability distribution – conditioned on all of history – and as long as that distribution has enough spread, we can use it.

In this regime it makes sense to add some resistance. This whitens the noise. It lowers the peak noise per unit bandwidth. However, remember that the area under the curve is independent of the resistance. So as long as the corner frequency is not large compared to what your digitizer can handle, you’re not losing anything by lowering the resistance.

If the added external resistance dominates the internal resistance, that provides a nice convenience: It means you know what the resistance value is. This simplifies the calibration.

## E  ALSA Timing

Three concepts:

• The ALSA period is the interval between interrupts. The driver catches interrupts from the hardware. For example, for the capture direction, the driver has no idea that data is available until the interrupt arrives. Reference: http://alsa.opensrc.org/Period
• Turbid defines an iblock and an oblock. These refer to the amount of stuff transferred between the userland program and the driver. Normally such things are called i/o buffers, but I call them blocks, to avoid confusion with the following item.

The block size can plausibly be as small as the period, and as large as the buffer. On input, a small block size seems to have little or no effect on efficiency. In contrast, output block size seems to have a significant effect on efficiency. Larger blocks are more efficient. I have no clue how to explain the contrast.

• The ALSA buffer is internal to the driver. It is typically a few times larger than the ALSA period. The buffer allows the capture hardware to get a bit ahead of the userland program, without causing an overrun. In the other direction, the buffer allows the playback program to get a somewhat ahead of the game, if it can.

Some observations:

• I’ve got it set up so that FrAppBuf is less than the ALSA buffer size.
• I’ve got it set up so that appbuf__fr is greater than the period. It makes no sense at all.
• The first readi blocks for .02 seconds. At this point the state transitions from 2 (prepared) to 3 (running).
• The first writei does not block at all. At this point the state transitions from 2 (prepared) to 3 (running).

Some ALSA timing is shown in figure 23. This is peculiar in a couple of ways. For one thing, the capture i/o buffer is a different size from the playback i/o buffer. Also, the first readi has been artificially delayed by 10 milliseconds relative to the first writei, just to clarify the relationship between the two streams.

Figure 23: Odd Timing

Each rectangle represents an i/o call. The bottom of the rectangle is the number of frames (input or output) before the call, and the top is the number of frames (input or output) after the call. Similarly the left edge is the time at the start of the call, and the right edge is the time upon return from the call.

In figure 23 the command was turbid show timing comb 0 freq 1000 ochan 0. You can see that the playback process has a lot of work to do, so the corner of one red rectangle does not meet the corner of the next. In contrast, the capture process has very little work to do, so the rectangles very nearly meet corner-to-corner.

Now consider what happens when we use snd_pcm_link(...) to synchronize the two streams. At the time of the first write to the playback device, the playback stream starts running (obviously) ... and the capture stream also starts running (because it is linked). The artificial 10 ms delay is still there, but it affects only the left edge of the first rectangle, not the right edge.

## F  References

1.
John Denker,
“Turbid, a High-Quality Randomness Generator – User’s Manual”
www.av8n.com/turbid/paper/turbid.htm
2.
“Court backs gambler”
http://www.theage.com.au/articles/2004/06/24/1087845018860.html
3.
John von Neumann,
“Various Techniques Used in Connection With Random Digits”
page 36 in Monte Carlo Method
proceedings of a symposium held June 29, 30, and July, 1, 1949
edited by A.S. Householder, G.E. Forsythe, and H.H. Germond
Institute for Numerical Analysis (published 1951)
4.
“NIST Removes Cryptography Algorithm from Random Number Generator Recommendations”
http://www.nist.gov/itl/csd/sp800-90-042114.cfm
5.
Wikipedia article, “Dual_EC_DRBG”
https://en.wikipedia.org/wiki/Dual_EC_DRBG
6.
Wikipedia article, “1980 Pennsylvania Lottery scandal”
https://en.wikipedia.org/wiki/1980_Pennsylvania_Lottery_scandal
7.
Wikipedia article, “Faro shuffle”
www.av8n.com/turbid/paper/faro.htm
8.
Persi Diaconis, Susan Holmes, and Richard Montgomery,
“Dynamical Bias in the Coin Toss”
SIAM Rev., 49(2), 211–235 (2004)
http://dx.doi.org/10.1137/S0036144504446436
https://statistics.stanford.edu/sites/default/files/2004-32.pdf
9.
John Denker,
www.av8n.com/physics/thermo/entropy-more.html
10.
Ian Goldberg and David Wagner,
“Randomness and the Netscape Browser”
http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html
(Break of Netscape SSL PRNG.)
11.
The VENONA project
http://www.nsa.gov/docs/venona/index.html
especially http://www.nsa.gov/docs/venona/venona_remember.html
12.
Common Vulnerabilities and Exposures,
“OpenSSL 0.9.8c-1 up to versions before 0.9.8g-9 on Debian-based operating systems uses a random number generator that generates predictable numbers, which makes it easier for remote attackers to conduct brute force guessing attacks against cryptographic keys.”
http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-0166
13.
Bryn Dole, Steve Lodin, and Eugene Spafford
“Misplaced Trust: Kerberos 4 Session Keys”
in The Proceedings of the 1997 Symposium on Network and Distributed Systems Security
https://www.cerias.purdue.edu/tools_and_resources/bibtex_archive/archive/97-01.pdf
14.
Dan Goodin,
“Fatal crypto flaw in some government-certified smartcards makes forgery a snap”
Ars Technica (16 Sept 2013)
http://arstechnica.com/security/2013/09/fatal-crypto-flaw-in-some-government-certified-smartcards-makes-forgery-a-snap/
15.
Dan Goodin,
“Google confirms critical Android crypto flaw used in \$5,700 Bitcoin heist”
Ars Technica (14 Aug 2013)
16.
John Viega and Gary McGraw,
See especially the section “How to cheat in online gambling” in chapter 10.
17.
Kai Michaelis, Christopher Meyer, and Jörg Schwenk,
“Randomly Failed! The State of Randomness in Current Java Implementations”
http://www.nds.rub.de/media/nds/veroeffentlichungen/2013/03/25/paper_2.pdf
18.
John Kelsey, David Wagner, Bruce Schneier, and Chris Hall
“Cryptanalytic Attacks on Pseudorandom Number Generators”
https://www.schneier.com/paper-prngs.pdf
19.
J. Kelsey, B. Schneier, and N. Ferguson,
“Yarrow-160: Notes on the Design and Analysis of the Yarrow Cryptographic Pseudorandom Number Generator”,
in Sixth Annual Workshop on Selected Areas in Cryptography, Springer Verlag, August (1999).
http://www.schneier.com/paper-yarrow.pdf
20.
Theodore T’so,
“random.c – A strong random number generator”
http://www.cs.berkeley.edu/~daw/rnd/linux-rand
21.
Ueli Maurer,
“A Universal Statistical Test for Random Bit Generators”.
Advances in Cryptology - CRYPTO ’90,
Lecture Notes in Computer Science, Springer-Verlag, 537, pp. 409-420 (1990).

Final version in J. of Cryptology (1992).
Abstract: http://www.crypto.ethz.ch/cgi-bin/show?what=abstractlabel=Maurer90b
Code: http://matrix.irb.hr/~mario/ftp/pub/random/utest.tgz
22.
National Security Agency
“TEMPEST Fundamentals”.
http://cryptome.org/nacsim-5000.htm
23.
FIPS 198 : HMAC - Keyed-Hash Message Authentication Code
(March 2002).
http://csrc.nist.gov/publications/fips/fips198/fips-198a.pdf
24.
Alfred J. Menezes, Paul C. van Oorschot and Scott A. Vanstone,
Handbook of Applied Cryptography,
CRC Press (1996). http://www.cacr.math.uwaterloo.ca/hac/
(The strict avalanche property is discussed on page 277.)
25.
George Marsaglia, “A battery of tests for random number generators”.
http://stat.fsu.edu/~geo/diehard.html
26.
Dmitry Khovratovich, Christian Rechberger, and Alexandra Savelieva,
“Bicliques for Preimages: Attacks on Skein-512 and the SHA-2 family”
IACR Cryptology ePrint 2011/286 (revised Feb. 2012)
http://eprint.iacr.org/2011/286.pdf
27.
Mario Lamberger and Florian Mendel, “Higher-Order Di erential Attack on Reduced SHA-256”
IACR Cryptology ePrint 2011/037 (Jan., 2011)
http://eprint.iacr.org/2011/037.pdf
28.
John Denker,
“Discussion of the Fortuna Randomness Generator”
www.av8n.com/turbid/paper/fortuna.htm
29.
Terry Ritter,
“Random Number Machines: A Literature Survey”
http://www.ciphersbyritter.com/RES/RNGMACH.HTM
30.
Peter Gutman,
“Software Generation of Practically Strong Random Numbers”,
http://www.cs.auckland.ac.nz/~pgut001/pubs/random.pdf
31.
Don Davis, Ross Ihaka, and Philip Fenstermacher,
“Cryptographic Randomness from Air Turbulence in Disk Drives”,
in Proceedings of Crypto 94, Springer-Verlag Lecture Notes in Computer Science, No. 839 (1994).
32.
Yuval Peres,
“Iterating von Neumann’s Procedure for Extracting Random Bits”
The Annals of Statistics pp 590-597 (1992).
http://www.stat.berkeley.edu/~peres/mine/vn.pdf
33.
Michael Mitzenmacher,
“Tossing a Biased Coin”
http://www.fas.harvard.edu/~libcs124/CS/coinflip3.pdf
34.
D. Eastlake, 3rd, S. Crocker, J. Schiller,
“Randomness Recommendations for Security” (1994)
http://www.ietf.org/rfc/rfc1750.txt
35.
R.J. Solomonoff,
“A Preliminary Report on a General Theory of Inductive Inference”,
Report V-131, Zator Co., Cambridge, Mass (4 Feb 1960).
36.
R.J. Solomonoff,
“A Formal Theory of Inductive Inference I and II”,
Information and Control 7, 1-22 & 224-254.
37.
Gregory J. Chaitin,
Algorithmic Information Theory,
Cambridge University Press (1987). http://www.cs.auckland.ac.nz/CDMTCS/chaitin/cup.pdf
38.
FIPS 140-1 : SECURITY REQUIREMENTS FOR CRYPTOGRAPHIC MODULES
http://www.itl.nist.gov/fipspubs/fip140-1.htm
39.
Elaine Barker and John Kelsey,
NIST Special Publication: “Recommendation for Random Number Generation Using Deterministic Random Bit Generators”
http://csrc.nist.gov/publications/nistpubs/800-90A/SP800-90A.pdf
40.
Meltem Sönmez Turan, Elaine Barker, John Kelsey, Kerry A. McKay, Mary L. Baish, Mike Boyle
“Recommendation for the Entropy Sources Used for Random Bit Generation”
http://csrc.nist.gov/publications/drafts/800-90/draft-sp800-90b.pdf
http://csrc.nist.gov/publications/drafts/800-90/sp800-90b_second_draft.pdf
41.
John Denker,
“Perspectives on Johnson Noise”
42.
NIST, “Secure Hash Standard” (August 2015)
http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
43.
Feynman, Leighton and Sands,
The Feynman Lectures on Physics

All three volumes are online. Volume I is at:
http://www.feynmanlectures.caltech.edu/I_41.html

44.
Horowitz and Hill
The Art of Electronics
45.
Charles Surya,
“Network Noise and Intermodulation Distortion”
Chapter 3 of Lecture Notes,
http://www.eie.polyu.edu.hk/~ensurya/lect_notes/commun_cir/Ch3/Chapter3.htm
46.
J. Johnson,
“Thermal Agitation of Electricity in Conductors”
Phys. Rev. 32, 97 (1928) http://prola.aps.org/abstract/PR/v32/i1/p97_1
47.
H. Nyquist,
“Thermal Agitation of Electric Charge in Conductors”
Phys. Rev. 32, 110 (1928) http://prola.aps.org/abstract/PR/v32/i1/p110_1
48.
B. Yurke and J.S. Denker, “Quantum network theory”
Physical Review A - General Physics, 3rd Series 29, 1419–1437 (March 1984)
http://pra.aps.org/abstract/PRA/v29/i3/p1419_1

..............

49.
http://alsa-project.org
50.
Intel Corp.,
“High Definition Audio Specification” Revision 1.0a (17 June 2010)
http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/high-definition-audio-specification.pdf

1
I do not wish to discuss other less-practical more-technical notions of universality, which are very widely misunderstood.
2
The metaphor here is that a small change in the input-conditions (throwing a snowball) provokes a 100% change in the output-conditions (knocking loose all the snow on the mountainside). In meteorology and complex-systems theory, the same notion is called the “butterfly effect”.
3
Other methods are possible. It can be shown that the choice of method doesn’t matter much.
4
As always, we have to choose some units for measuring length and surprisal, and use those units consistently. This is tantamount to choosing a consistent base for our logarithms and exponentials. A conventional and convenient unit is the bit.
[Contents]