I find it disingenuous to see properties such as equidistribution or low discrep...

mmalone · on Nov 21, 2015

If, by quality, you mean "indistinguishable from true randomness" then there are many valid metrics. One important metric that many non-CS PRNGs have, but many CSPRNGs do not have, is a provable deterministic cycle length for all seeds.

These characteristics don't just make Monte Carlo simulations converge quicker, they also do things like guarantee that your 52 deck shuffle will definitely produce the one shuffle that gives player two a royal flush on the flop in a round of texas hold 'em. Or, more generally, they guarantee that your Array.shuffle() method is unbiased for a reasonable sized array. In general you need true randomness and/or a long cycle linear-transformation PRNG to make those sorts of guarantees. It's also nice that they're fast. That said, I do also agree that the scenarios in which these things matter are rather rare, and most people should probably just use a CSPRNG by default.

Do you really think a CSPRNG is fast enough for your standard library's array shuffle though? Wouldn't you rather have something like xorshift1024* that's probably > an order of magnitude faster? Maybe a CSPRNG is fast enough... I don't have all the numbers. So this is an honest question.

ta0o0o0 · on Nov 21, 2015

CSPRNGs are a type of pseudo random number generator, which are defined to be deterministic. I have no problem with calling /dev/urandom a CSPRNG for practical purposes, but very strictly speaking, it's not.

johncolanduoni · on Nov 21, 2015

Do you mean there cannot be a CSPRNG? Most of cryptography is based on security in practice, not in absolute theory (i.e. everything except for one time pads and maybe some quantum crypto). That's like saying there are no cryptographically secure encryption algorithms other than one time pads, since you can break all of them in theory. That's a pretty useless definition of cryptographically secure.

edit: By in theory I mean that with enough computation resources you could break them even if you didn't find some new, clever weakness. Not that you could break them in theory because a weakness could always be found.

ta0o0o0 · on Nov 21, 2015

No, I mean that the fact that /dev/urandom "reseeds" in mid stream means it is not strictly speaking pseudo-random, since it is not completely deterministic. Maybe I'm wrong, but the comment I was replying to was arguing that /dev/urandom was the only CSPRNG, and things like stream cipher algorithms are not.

johncolanduoni · on Nov 21, 2015

> But so is a statistical test, or any 'nice' property that makes some algorithm behave better.

Not really. In general, the statistical tests are chosen such that when the law of large numbers comes into force, a perfectly random sequence would pass them with flying colors. If repeated runs of a perfect random number generator were failing these with a frequency that were improbable (i.e. far outside what the central limit theorem would predict), I think it would be safe to say the universe was playing a trick on you. Many of these are very closely related to predictability in practice as well. Uniformity is a statistical test, and I think it's probably the lowest bar you can set for "unpredictable".

The fundamental issue with measuring "randomness", especially for a PRNG, is that there is no set definition of information entropy suitable for calculation. Shannon entropy requires you to have a (possibly numerically sampled) PDF. The problem with that is that is you want to find the entropy of a sequence, you can't use the PDF of all the bytes you've gotten out the generator, since that PDF will have 1 for the exact sequence you obtained, and zero everywhere else. If you do it by multiple runs, there is no way you'd get enough data to construct an accurate PDF with a suitably large sample size. You have to use something based on highly subdivided sequence of values, but if you only look at each value in isolation you will only determine the uniformity with some small bin size, which ignores patterns due to proximity of values in the output.

So generally entropy is calculated using a more complex probability kernel, like the ones used in these statistical tests. But none of these can possibly be perfect. The "perfect" measure of entropy is Kolmogorov complexity[1], which captures the minimum length of a program you would need to be able to print the given output. On top of being undecidable in general, and also unobtainable in practice, you can show an upper bound of the Kolmogorov complexity on a CSPRNG that is actually quite low. Just take the algorithm the CSPRNG uses, plus any external entropy it got (from hardware interrupts, etc.). So in the end, various kinds of statistical tests are the only ones that make sense.

The actual goal of a CSPRNG is far more specific than just simulating a truly random sequence, which is why it differs from those used in simulations. It is unpredictability, defined by trying to make it so this short program which can generate all the output is impossibly hard to obtain from just the output in practice. In a vague sense, it is like (some of) the desired properties of a hash or cipher; we can't use a perfect random source (resp. one time pad), so let's get something that will take an impossible amount of computational power/measured output to break (resp. ciphertext using the same key), and mix it in with some as true random as we can get seed (resp. IV), and then mix some more in from time to time to make it even harder (resp. changing keys). That is the reason stream ciphers, while generally not as good as /dev/urandom-style CSPRNGs, are actually closely related enough that I think it's fair to call them CSPRNGs as well. They're just ones with a far worse algorithm/entropy source.

[1]: https://en.wikipedia.org/wiki/Kolmogorov_complexity

pbsd · on Nov 21, 2015

> Not really.

A statistical test _is_ a distinguisher! This seems self-evident to me. When the test repeatedly fails (with an extreme p-value or whatever criteria you want to apply here for 'failure'), you have successfully distinguished the sequence from random. Also you seem to misunderstand what I meant by equidistribution---uniformity is a different property, and of course crucial. I am puzzled, then, as to why you claim some generators are more 'uniform' than cryptographic ones.

Any pseudorandom bit generator whose next bit cannot be predicted in under 2^n "operations" also passes _all_ statistical tests that require less than this computational effort. This is a well-known result by Yao. In other words, unpredictability and indistinguishability are equivalent. In the case of, e.g., AES-CTR, n = 64 (birthday-bound distinguishers force it to be 64 instead of 128).

Your philosophizing about what true randomness really is does not strike me as very useful for practical purposes. Given unbounded computational power, obviously no such algorithm is actually indistinguishable. But who cares?

mmalone · on Nov 21, 2015

Unpredictability is irrelevant to a use case like shuffling an array. Moreover, your conjectured polynomial-time perfect CSPRNG may actually produce a less useful stream of entropy for this use case than a good non-CS PRNG because its cycle length may well be shorter, and all that matters is uniformity (which can actually be guaranteed, not just conjectured and measured empirically) and enough seed entropy to kick things off. Top it all off with the non-CS PRNG being perhaps orders of magnitude faster.

I'll repeat once again that I do think people should generally default to CSPRNGs, but I really don't understand why everyone involved in crypto insists that there are no valid use cases for non-CS PRNGs / that CSPRNGs are uniformly better for all use cases when it seems very obvious to me that they are not. I don't see any reason why a good PRNG and a good CSPRNG shouldn't exist in every standard library. Am I missing something?

pbsd · on Nov 21, 2015

A short cycle would be a catastrophic outcome for any serious cipher. Although not every generator has a provably guaranteed period, relying instead on the properties of the 'average' random function, it easy to guarantee period. AES in counter mode, where key=seed and block=counter, has a guaranteed period of 2^128.

As to why we insist that cryptographic primitives are uniformly better? Well, that is more subjective. I personally believe that they have gotten fast enough that the set of cases where they are unacceptably slow is vanishingly small, and the trend is clearly on the side of crypto getting faster. And I can sleep at night knowing that my results are not wrong due to some unknown, unexpected, property of my generator.

mmalone · on Nov 21, 2015

That's all fair-ish, and I think I need to learn more about CS alternatives. Necessary cycle length is relative. For crypto it seems like the definition is "big enough that it generates keys that are infeasible to predict," or "big enough that it can be used to stretch entropy a long ways for a cipher" for all practical purposes. But even the AES generator you cited sounds like it would be incapable of generating an unbiased shuffle of a deck of cards, which has 52! or ~2^225 unique shuffles... For a use case like this it's hard for me to wrap my head around using anything other than a good non-CS PRNG or true randomness. If you believe otherwise I'm truly interested in your thoughts, links, etc.

Another case where a long cycle length is useful would be the sort of scenario in my post, at much larger scale. If you have thousands of machines re-seeding at every service restart and then generating sequences of millions or billions of random numbers from a sequence, and you want a vanishingly small probability that those sequences will never overlap, could you use a CSPRNG?

Re: the generator properties, do you _really_ believe that? I mean, good non-CS PRNGs usually have fairly rigorous presentations and pass statistical tests. They aren't as sexy as crypto stuff, and maybe don't get the eyeballs, but they're relatively straightforward proven math whereas all of the CSPRNG results are based on conjecture. It seems at least as likely that some unexpected property of a CSPRNG will be discovered, a la RC4.

pbsd · on Nov 21, 2015

The shuffle question ties in with the overlap question. Let's stick with the AES in counter mode example. AES is a block cipher, or in other words a keyed permutation. What this means is that each key (resp. seed) determines a distinct permutation that we use to generate the blocks AES_K(0), AES_K(1), etc. In effect, each seed will result in an entirely distinct and independent 2^128-block sequence, instead of simply the same sequence starting at a different place, which is usually the case for non-cryptographic generators. So 2^128 possible keys (or 2^256, if you go AES-256) times 2^128 output blocks per key is probably enough to cover all possible card shuffles.

I do believe that. RC4 is a good example, actually; the unexpected properties that make RC4 be considered completely broken for cryptography are considered irrelevant for simulations (I won't swear by this, but I'll bet that RC4 passes all of TestU01's tests). The standards are just so much higher for cryptographic primitives. I'm sure unsexy generators can be suitable to this or that application. But due to the ad hoc nature of the statistical tests used, you can't be sure that your generator is not oddly correlated with a concrete application. I've written a comment a few days ago---on a thread very similar to this one---about a real-world case of this happening [1].

[1] https://news.ycombinator.com/item?id=10549024

mmalone · on Nov 21, 2015

Ok, I get the seed/state space vs. cycle length bit. So it can theoretically generate all of the permutations of a shuffle if you re-key/re-seed periodically. But you can't guarantee that it actually will, right? It is possible that, for all 2^128 keys, and in all 2^128 cycles there are some shuffles that AES will never generate (in other words, some sequences of 52 numbers that will never be generated) so you effectively have a 0% chance of hitting them.

From a cryptanalysis standpoint I think that's consistent -- the probability of generating any particular single shuffle is astronomically low, so the difference between the actual probability and 0% is probably outside the realm of what's computationally feasible. But with something like MT19937 I can show that its at least possible for every outcome to have a non-zero probability. Perhaps this is just interesting academically, but maybe not if you're actually playing a card game for instance?

pbsd · on Nov 22, 2015

No, that cannot be guaranteed. Equidistribution in 52 dimensions requires necessarily a minimum state size of 52 log_2(52) bits, which we don't even have. Even if we did, it could not be guaranteed anyway.

You have already covered the distinction between guaranteed equidistribution and not, so I don't think there is much to add there. I do think the concern is mostly academic, since the difference between a subsequence that doesn't exist and one that will never be reached before the universe collapses is irrelevant to the user. I will note that Richard Brent, someone who is much better qualified than I to discuss this, has argued against the usefulness of equidistribution: http://maths-people.anu.edu.au/~brent/pub/pub240.html

nialo · on Nov 21, 2015

I think you have to hit every permutation, at least for AES, because AES is reversible.

In particular, if there were some 0 <= y < 2^128 such that AES_k(x) != y for all 0 <= x < 2^128, then there must also be some z such that two different values x1 and x2 encrypt to the same value, which can't happen if AES is reversible.

mmalone · on Nov 21, 2015

Sure but I don't immediately see how that equates to a proof of its ability to produce every combination of, say, 52 consecutive numbers... Just that it will produce every possible individual number. In PRNG-parlance it shows 1-dimensional equidistribution but not 52-dimensional (or generally (k, w)-dimensional) equidistribution. Or is there another pigeonhole step I'm missing?