Don‘t Make a Hash of it

Home
Career
Developer blog
Don‘t Make a Hash of it

2 min read — by Arno Schödl

When hashing data comprised of several components, such as a struct or tuple or range, the solution offered by the C++ standard library is calling std::hash on the individual components and somehow combining the results, for example by XORing with a salt.

But performance-wise, this is less than ideal. Hash functions typically keep internal state larger than the hash output while digesting input, and to compute the hash value, run some finalization operations. For example, HighwayHash has an internal state of 1024 bits, but after finalization, outputs 64 to 256 bits. std::hash calculates a final hash value, so has to run these finalization operations on each call. A streaming hash is better: you pass a stream of all the data you want to hash to the hash function, and only run the finalization at the end.

Clearly, to minimize collisions, for any given type T to hash, different values of T must produce different streams passed to the streaming hash. Any such stream is also a valid serialization of a value of type T because it can be used to reconstruct the value: given a stream corresponding to a value of T, you enumerate all possible values of T, produce all their streams, and find the one that matches the given stream. So any hash stream is a valid serialization stream, and vice versa.

To illustrate how ignoring the equivalence of hashing and serialization leads to bad hashes, suppose we have a struct of two vectors. If you hash each vector by hashing all their values in sequence, but not hashing the vector sizes, you would get a hash collision if the concatenation of the two vectors has the same values. And of course, such a stream would also not be an unambiguous serialization of the two vectors.

Additionally hashing the sizes would solve this problem. For vectors, that‘s ok, but for forward ranges, we do not know the size in O(1) and would have to traverse the range twice, once to find the size and again to get the values. That‘s inefficient.

It would be better to hash the size at the end. For a forward range, we could then count the number of items while hashing them, requiring only a single traversal. But does this produce unambiguous hashes? Only if it is a valid serialization, which at first glance it is not: sticking to the two-vector example above, reading from the beginning of the stream, we do not know where in the stream the serialization of the first vector ends and of the second vector starts.

But here is a nifty trick: we can assume the serialization stream is the reversed stream. Then we would read the size of the second vector first and thus know when to switch to reading the first vector. But this means that, to allow freely composing hashes, all types must produce their hash streams such that the reversed stream is an unambiguous serialization! For example, std::optional first hashes its value if engaged, and then, unconditionally, the bool if the optional is engaged. See our libary's tc::hash_append for implementations for many data types.

Do you have feedback? Send us a message at devblog@think-cell.com

We are hiring

Enough time to make sure that every detail of your solution is perfect
Become part of an international team of brilliant minds
No scheduled meetings

Join as a C++ developer