Saturday, November 12, 2011

Another explanation of hash functions

I think I've found a new way to explain hash functions and convey the intuition behind them.

First, think back to the classic problem from grade school: a farmer raises cows and chickens. His animals have a total of, say, 10 heads and 22 feet. How many cows and chickens does he have?

If you know algebra, you probably laugh at how easy the problem is, though it's still interesting for kids, who generally have fun with it. Now, you can either solve it the kid way, or the algebraic way. The kid way is to "guess and check": that is, guess a number of chickens and of cows, find the corresponding number of heads and feet, and check if it matches that given in the problem. The algebraic way is to let x be the number of chickens, y the number of cows, and write:

x + y = 10
2x + 4y = 22

And then solve for x and y the standard way. (In case it needs to be said, chickens have one head and two feet while cows have one head and four feet.)

What does this all have to do with hash functions, though? Glad you asked.

A cryptographic hash function for 10,000 BC

Recall the requirements for a cryptographic hash function: it must be easy to compute the output (digest) for any input (preimage), but hard to compute the input for any output (other than by trying every preimage). In other words, a one-way function (or trapdoor one-way function without the trapdoor).

So, here's a hash function for a mathematically backward world: First, it takes two integers as input. You let the first be number of chickens, and the second be the number of cows. Then, output the total number of heads and feet (perhaps with a simple separator, like "10:22" for the above example).

It works as a hash function for the people of 10,000 BC precisely because of their (relatively) poor mathematical understanding. Since they don't know to express it as a system of linear equations, or otherwise derive a simple general solution, someone trying to "crack" a given hash digest (output) would have no choice but to guess and check a bunch of chicken/cow possibilities. This method is referred to as "brute force" in cryptography, and as long as a hash function hasn't yet been "broken" (by a better understanding of the math theory involved), it's the only option available.

Though this situation may seem contrived to us now, the same dynamic is at play in modern, military grade cryptographic hash functions: so long as people lack sufficient mathematical understanding, it is impossible to invert a hash digest except by brute force. The only difference between now and then is that the math and computations are harder.

In the previous post, I mentioned how hash functions can be used to protect stored passwords while still allowing password-based authentication. Let's go over how this would work.

An encrypted password system for 10,000 BC

Let's say Mike the Merchant wants to set up a service of warehousing people's valuables. He'll deal with many customers and doesn't want to count on remembering their faces when they come to reclaim their stuff. So, he'll give each customer a unique password they must use to get in. This will be a stronger authentication system than just trying to remember every customer.

Like modern website owners, Mike also wants to keep his password records from being stolen or misused (say, to steal his customers' stuff). If he simply kept the passwords in a book, someone who gained access to it could copy them and then come one by one to illicitly claim the goods in the warehouse. (Though of course, Mike can always use common sense security measures, like noticing something is fishy when the same customer one day claims to know all the passwords!)

So, rather than store the passwords, he converts them into two integers (if the password isn't already in that form) and stores the digest from putting them through the hash function described above. So, if Harry the Hoarder's password were 100:350, Mike would store Harry's password as 450:1600. (That is, interpret the number before the colon in the password as chickens and the latter as cows, and then store the number of heads [100 + 350 = 450] and number of feet [2*100 + 4*350 = 1600], separated by a colon.)

Then, when someone comes in claiming to be Harry, Mike asks for the password. Next, instead of comparing Harry's password (100:350) to an entry in his book, he first computes the hash digest (450:1600), and compares that number with his entry for Harry. So, he still has a working system to authenticate people by password.

What protection does this offer Mike if someone gains access to his password book and copies the entries? Well, remember, like with modern systems, all they get are the digests that result from putting the passwords through the hash function, not the passwords themselves. And knowing a digest won't convince Mike that you're the account holder: remember, he's checking for a pair of numbers that hash to your entry, not the entry itself.

Could the attacker infer the passwords from the digests listed in Mike's book? Yes, but it would take infeasibly long to do so -- that's the role of the hash function. Going from cows/chickens to heads/feet is easy, but going the other way is hard (for a mathematically backward society, at least). To make any use out of Mike's hashed-password book, an attacker would have to guess a huge number of passwords and see if their digests match any in the book. As long as the password space is big enough, and the society remains mathematically backward enough, it's just not feasible for an attacker to guess and check enough passwords to find a match in the book.

And like in the previous section, this is the same position we are in with respect to hashed password storage today: with a good enough mathematical breakthrough, it might become feasible to quickly invert a hashed password, but as it stands now, the hash functions used are enough to render such databases useless to attackers (though obviously some attacks still get through, such as when a website continues to use a long-broken hash function). The only difference is the complexity of the math.

Of course, Mike still has to make sure his customers don't give out their passwords or store them insecurely ... a problem we still grapple with, twelve thousand years later.

7 comments:

Anonymous said...

100+350=450 not 400. Other than that, great explanation, even for this old brain.

Rahul said...

Good analogy.

The one feature it doesn't capture is the economy of storage (and possibility of subsequent clashes).

Could two different pairs of cows and roosters lead to the same heads and legs?

A sha or RSA can generate a (reasonably / computationally) unique fingerprint in 64 bits of millions of line of text.

Can we extend your analogy?

Silas Barta said...

Good points, Rahul. I had considered extended the analogy, but wanted to keep it simple. If I had continued, here is what I would have said:

To make this more like modern hashes, the output would have to be fixed length, independent of the input length. The cow/chicken hash function as described allows infinite size, and varies with input size.

To turn it into a fixed-output-length hash, one simple way would be to add one more step: divide both values by some large number n and post the remainder. Or, in math jargon, compute the output modulo n. (That would also allow collisions -- different pairs of cows/chickes to produce the same hash digest.)

What counts as a "large n" would depend on what resources attackers are likely to have, how many passwords you will want to store, and how hard you can permit the process to be when validating someone's password.

Furthermore, you don't want attackers to be able to pre-compute big lists of preimage-digest pairs (aka rainbow tables), so you'd want to add salting. That is, generate a unique random number (or two random numbers) that you store wish each hash, and which somehow mixed in with the preimage (i.e. it somehow alters the input number of cows/chickens) before hashing.

Does that answer your question?

Rahul said...

Silas,

sure what you say makes sense. It just doesn't stay as "cute" and appealing as your original analogy.

Was wondering if there was a natural extension to the story. But I can't think of a better extension either.

webdesigne said...

great work guyz

eSignature said...

A special thanks for this informative post. I definitely learned new stuff here I wasn't aware of !

Anonymous said...

These problems use to make me feel dizzy all the time and I am still not able to solve them. Once you are able to understand the concept it is easy for solving these types of problems.