isn't quite ashamed enough to present

jr conlin's ink stained banana

:: Could Someone Check My Math?

Dear Lazyweb,

i am not a math whiz. i get math, but not to the level that it's effortless music. So i'd appreciate ye far smarter folk double checking my thinking here.

The Problem
For the Notifications stuff i'm working on, i need a token that is Really Hard to Guess. That means Lots of Entropy stuffed into a token.

Because this is going through SMTP, i have a hard limit of 64 characters. While also not part of the RFC (or anything else, experience with various MTAs have shown that you're limited to a through z and 0 through 9. Using a handy Entropy Bit Calculator, that shows that:

log2(36)*64 ~= 331 Bits of Entropy.

That means you have a 1 in 2331 (or 4,374,501,449,566,023,848,745,004,454,235,242,730,706,338,861,700,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000) chance of guessing it. (In case you're curious, that's 4 duotrigintillion)

That's great, and all, but does lead one to wind up getting a token that is 64 characters of random crap, like:
wgmrbj3d9pcv1op6opprw7hvjgzimzonakq3m02egtjh8zs2bzyuxnowfx5e9fxu

So i was asked to research how good/bad an idea it might be to make the token a little less random.

The Solution

Specifically, i was asked if i could create a token based on words. That means you'd get a token like:
faucet.variorum.hambling.baiters.harmed.stoa.haysel.glandular.61
i arbitrarily decided to separate the words with periods and back-fill with numbers.

That leads to some interesting math.

i pulled the English Open Word List and collected all the words into a single file (dropping the words that have non-ascii characters). That produced a catalog of 128,766 candidate words. A crappy token generator script later, i found that the mean token consisted of 8 words and was about 61 characters long. By my guess, each token would consist of
[a-z]{2} + // shortest word length was 2 characters
[a-z\.]{59} + // the bulk of the token would be character or a period
[a-z0-9\.]{3} // the remainder would be character, number or period

This would lead me to think that the base entropy would be something like:
(log2(26)*2) + (log2(27)*59) + (log2(37)*3)
or about 306 bits of entropy. (1 in 130 novemvigintrillion odds)

The real problem

But i'm not sure that's true.

That's supposing that the core of the characters being used are taken randomly from that pool of 27 available characters. They're not. These are using words in the English language which means that there are certain character distributions which diminish the pool of potential entropy. Granted, that's masked somewhat by the varying word lengths, but that still means that there aren't a hell of a lot of words that have J, Q, X or Z in them, and that the default letter set that Wheel of Fortune picks for you sucks.

If one takes only the letters that have a greater than 10% chance in 1000 of appearing, then you wind up with about 15 meaning you've got:

log2(26)*2 + log2(27)*59 + log2(37)*3 = 30 bits of entropy, or 1 in 1,073,741,824

That number is no where near good enough. It's even pronounceable.

So, really, the question comes down to: "How much entropy am i losing by using real words?"

Thanks, Lazyweb!

i'll just keep refreshing the page until you answer.

As pointed out in the comments, the short answer is "It's a bad idea". You wind up picking 8 items out of a ~ 128K pool and that's not enough to prevent folks from reasonably guessing.

'tis good to have friends with big brains.

Blogs of note
personal Christopher Conlin USMC memoirs of hydrogen guy rhapsodic.org Henriette's Herbal Blog
geek ultramookie

Powered by WordPress
Hosted on Dreamhost.