## A proof via the pigeon-hole principle

Suppose Bob tells you that his program, Ultrazip, can compress
*any* file to at most some fraction *p* of its original
size.

For example, if Bob says he can give you guaranteed compression of
80% (*p = 0.8*) and you give Ultrazip a 100 byte file, you will always get back a
file of 80 bytes.
(Applying it again would give you back a file of 64 bytes.)

Let's examine Ultrazip on all files of length *n* bits.

Essentially, Bob's program is a function, *P*, from the set of
files of length *n* to the set of files of length *pn*.

There are 2^{n} files of length *n*.

We'll call this set of files *A*.

There are 2^{pn} files of length *pn*.

We'll
call this second set of files *B*.

Since the size of the set *A* is greater than the size of the set
*B*, the program *P* must map several files in the set
*A* to the same file in the set *B*.

Thus, the program *P* cannot represent a one-to-one function, and
hence, it cannot have an inverse.

Consequently, no reliable *decompression* program can exist for
files compressed with *P*, so Bob's program either does not
exist, or it is useless.

This style of reasoning is an invocation of the pigeon-hole principle.

The pigeon-hole principle observes that if you have *n* pigeons,
but less than *n* holes, then at least two pigeons will be
sharing a hole.

### An example of the case *n* = 3

There are 8 possible files of bit length 3, because 2^{3} = 8:

000 001 010 011 100 101 110 111

Apply the Ultrazip program to these files that gives 66% compression. Now, all 8 files have length 2.

However, there are only 4 possible
files of bit length 2, because 2^{2} = 4:

00 01 10 11

No matter how you try to map the 3-bit set of files to the 2-bit set
of files, *at least one* 2-bit file will represent more than one 3-bit
file.

Consequently, information is lost, and the "compression" is irreversible.

## The default counter-refutation: The
*b* bits lower limit

A lot of miraculous compression programs claim that they beat the mathematical theory by not being able to compress files below a certain limit.

Suppose that *b* bits is that lower limit.

(1024 is a popular choice for this lower limit for some reason.)

Defeating this claim is easy.

Just try to compress all files of length *b* + 1.

There are 2^{b+1} of these files, but only
2^{b} files to map them onto.

By the same pigeon-hole argument as before, at least two of those original files, which started out differently, will have the same compressed form, and hence, they'll be undecompressible.

## How does compression work in practice?

Of course, compression seems to work in practice.

First of all, not all compression is lossless, such as image compression.

Secondly,
if the information in question has structure to it, *i.e.*, some files
are more likely to be compressed than others, an *average*
(but not guaranteed) compression can be computed.

Take the 3-bit example again. Suppose that in practice, the only documents that Ultrazip will ever have to compress are 000 and 101.

Then Ultrazip can just yield 0 for 000 and 1 for 101--an astounding "33%" compression.

However, if you were to try to compress, 001, Ultrazip would have to choose at least a two-bit value, and you would not get a 33% compression rate.

Most every kind of information that humans deal with has a predictable structure to it.

Compression programs take advantage of that structure, and if it's not there, compression programs can't make the file any smaller on average.

In fact, for any real compression program, there will be at least one
file whose "compressed" version is *larger* than its uncompressed
version.

You can test this proposition empirically: try generating random files and running them through your favorite compression utility.

## Objections

Even after reading this article, some folks still think that infinite compression is possible.

Every scheme, of course, has a hole.

The *proof* above guarantees that a hole in the scheme exists.

I'll put rebuttals to some of the proposals I've received here.

### What about hashing?

Robin Spark wrote to me with a proposal:

In response to your article on compression, just wondering if you have come across this idea of compression:

-compute the SHA has of a file of any length. This is the compressed version of the file, stored along with the size of the file.

Then reverse engineer the file by systematically building every possible combination of 0's and 1's for files of the the original length.

Calculate the SHA for the file, and when the correct file has been found, obtain the number of iterations that were needed. The number of iterations is transmitted as part of the compressed file.

With quantum computers *only* 20 years away, this could become a reality.

So, why doesn't this work?

First, hashes are not unique. For any given hash function, there will be an infinite number of collisions. They may be hard to find, but they must exist.

Consider a hash function with *n*-bit hashes.

Now, consider files of length *m* bits for *m > n*.

There are 2^{m} files of length *m*, but they must share 2^{n} hashes.

Collisions are inevitable, which is why hash functions are also known as irreversible functions.

Second, the number of iterations required to reach the right file is exactly equal to the contents of the file in binary.

In other words, storing the number of iterations plus the hash would be *larger* than the original file.

Third, while a quantum circuit could compute all of the hashes of a file simultaneously, these hashes would be entangled with each other.

It would not be able to reveal which inputs hashed to the supplied value.

Robin offered a new proposal:

Then, instead of storing the number of iterations, could you store the first and last Megabyte of the file?

When reconstructing the file, you check the SHA and the start and end of the file.

Of course you need to ensure that the SHA hash being calculated was unique under these circumstances. If SHA couldn't cut it, maybe another algorithm, or combination of algorithms could?

Ie. You could calculate the SHA of the file, and then calculate the SHA of the reversed file - providing 2 hashes that could be used together to identify the correct file.

This does not (indeed, must not) work either.

The principle reason is the same: any hash into a finite number of bits cannot guarantee uniqueness for all files greater than that number of bits.

Supposed you have files of length *m* bits.

Store *k* bits from this file (anywhere in the file) with the hash.

How many possible files of length *m* share these same *k* bits?

2^{m-k} files.

Once again, however, you have only 2^{n} hashes available, which means collisions are still inevitable.

Some files of length *m* must compress to the same value, even with *k* bits stored.

Therefore, the compression is not guaranteeably reversible.

Even if you use multiple hash functions, one with *i*-bit hashes and the other with *j*-bit hashes,
they have no more effect than a single hash function with *i+j*-bit hashes.