Hash functions are used for a variety of purposes, which we will discuss [[Applications of hashes|later]], but probably the main use is to verify that the content of a message is genuine and unaltered. To do this reliably, a hash must be *cryptographically strong*, otherwise it can be spoofed by an attacker. Strong hashes must satisfy addition criteria.

If a hash algorithm H has the following characteristics, it is said to be a *one way hash*:

*For a message m, the hash h = H(m) can be calculated within a reasonably short time.*

You must be able to calculate the hash of a message reasonably quickly otherwise the algorithm is of little practical use. This almost goes without saying, but it does set a limit on how computationally demanding a hash algorithm is allowed to be. A lot depends on the context. Is the algorithm running on a powerful workstation, or the tiny processor in a smart card? Is the hash being used to check gigabytes of data in an anti-virus system, or is it being used once a day to check a users logon password?

*For a hash value h, it is computationally infeasible to find a message m such that H(m) = h.*

This is the basic test of security for a hash algorithm - given a particular hash value, is it possible to go away and find a message which has that hash value. In the case of a simple checksum, it is trivial to construct a message with any given checksum (a message which consists of X bytes, each with the value 1, will have a checksum of X).

Any hash algorithm which does not meet this criteria should not be considered secure for any application.

*For a message m, it is computationally infeasible to find a second message m1 such that H(m) = H(m1)*

This refers the case where an attacker knows the original message and the hash (in any case the hash can be calculated from the message). This additional information might help them to create a second message with the same hash value, for example if the hash algorithm has weaknesses.

With a weak algorithm, it might be possible to change one part of the message, and then make an "opposite" change somewhere else in the message to maintain the same hash value. This would allow an attacker to change an important part of the message, and hide the change by also changing an unimportant part of the message to keep the hash the same.

Some hash algorithms also have an additional property, collision resistance, which means:

*It is hard to find two messages m and m1 such that H(m) = H(m1)*

In this case, the attacker is simply trying to find any 2 messages which share the same hash code. This is potentially easier than trying to find a message with a particular hash - it is also less useful in many cases, but there are still dangerous attacks which are made possible if the hash is not collision resistant. The importance of this will become significant later when we discuss birthday attacks.

We tend to make use of hash codes as if there was a unique, one to one correspondence between the message and the hash. If 2 messages have the same hash code, then their content must be identical.

Of course this isn't strictly true. In fact it isn't true at all, but as we shall see it is an extremely reasonable assumption.

A 128 bit hash has 2^{128} possible distinct values (approximately 10^{38}) - this is a very large number, but it is still finite.

The number of possible messages is far, far greater than this. To give a concrete example, the total number of possible messages of 1 GByte or less is around 10^{5,000,000,000} - that is a one followed by five thousand million zeros!

In other words, for every possible 128 bit hash code, there are a huge number of possible files which share the same hash code. In fact, there are almost an infinite number, for any practical purposes.

So how can we consider a hash code to be unique to a message? If I create a message and find its hash code, knowing perfectly well that there are an almost infinite number of possible alternatives, how can we be confident that nobody else can find a message with a matching hash?

The point is, of course, that for every possible matching message there are 10^{38} messages which don't match. If you set yourself the task of finding a match within the next 10 years you would need to be searching around 10^{28} messages a second, which is utterly infeasible given current technology.

Copyright (c) Axlesoft Ltd 2020