Impossibly good compression

[article index] [] [@mattmight] [rss]

Every so often, a company claims to have invented a "perfect" compression algorithm--an algorithm that can always reduce the size of a file.

If such a magic algorithm actually existed, then it could be applied repeatedly to any file to make it smaller and smaller.

Download times would drop to nothing.

Perhaps it's the allure of such fantastic benefits that keeps a steady stream of suckers investing in these doomed ideas.

(Companies claiming to have invented unbreakable encryption and perpetual motion machines show up at about the same rate.)

It's easy to show that an algorithm that can always compress a file leads to a logical contradiction: that is, such an algorithm cannot possibly exist.

I am happy to testify as an expert witness in court cases where there is a claim of infinite or guaranteed compression.

A proof via the pigeon-hole principle

Suppose Bob tells you that his program, Ultrazip, can compress any file to at most some fraction p of its original size.

For example, if Bob says he can give you guaranteed compression of 80% (p = 0.8) and you give Ultrazip a 100 byte file, you will always get back a file of 80 bytes. (Applying it again would give you back a file of 64 bytes.)

Let's examine Ultrazip on all files of length n bits.

Essentially, Bob's program is a function, P, from the set of files of length n to the set of files of length pn.

There are 2n files of length n.

We'll call this set of files A.

There are 2pn files of length pn.

We'll call this second set of files B.

Since the size of the set A is greater than the size of the set B, the program P must map several files in the set A to the same file in the set B.

Thus, the program P cannot represent a one-to-one function, and hence, it cannot have an inverse.

Consequently, no reliable decompression program can exist for files compressed with P, so Bob's program either does not exist, or it is useless.

This style of reasoning is an invocation of the pigeon-hole principle.

The pigeon-hole principle observes that if you have n pigeons, but less than n holes, then at least two pigeons will be sharing a hole.

An example of the case n = 3

There are 8 possible files of bit length 3, because 23 = 8:

000
001
010
011
100
101
110
111

Apply the Ultrazip program to these files that gives 66% compression. Now, all 8 files have length 2.

However, there are only 4 possible files of bit length 2, because 22 = 4:

00
01
10
11

No matter how you try to map the 3-bit set of files to the 2-bit set of files, at least one 2-bit file will represent more than one 3-bit file.

Consequently, information is lost, and the "compression" is irreversible.

The default counter-refutation: The b bits lower limit

A lot of miraculous compression programs claim that they beat the mathematical theory by not being able to compress files below a certain limit.

Suppose that b bits is that lower limit.

(1024 is a popular choice for this lower limit for some reason.)

Defeating this claim is easy.

Just try to compress all files of length b + 1.

There are 2b+1 of these files, but only 2b files to map them onto.

By the same pigeon-hole argument as before, at least two of those original files, which started out differently, will have the same compressed form, and hence, they'll be undecompressible.

How does compression work in practice?

Of course, compression seems to work in practice.

First of all, not all compression is lossless, such as image compression.

Secondly, if the information in question has structure to it, i.e., some files are more likely to be compressed than others, an average (but not guaranteed) compression can be computed.

Take the 3-bit example again. Suppose that in practice, the only documents that Ultrazip will ever have to compress are 000 and 101.

Then Ultrazip can just yield 0 for 000 and 1 for 101--an astounding "33%" compression.

However, if you were to try to compress, 001, Ultrazip would have to choose at least a two-bit value, and you would not get a 33% compression rate.

Most every kind of information that humans deal with has a predictable structure to it.

Compression programs take advantage of that structure, and if it's not there, compression programs can't make the file any smaller on average.

In fact, for any real compression program, there will be at least one file whose "compressed" version is larger than its uncompressed version.

You can test this proposition empirically: try generating random files and running them through your favorite compression utility.


Objections

Even after reading this article, some folks still think that infinite compression is possible.

Every scheme, of course, has a hole.

The proof above guarantees that a hole in the scheme exists.

I'll put rebuttals to some of the proposals I've received here.

What about hashing?

Robin Spark wrote to me with a proposal:

In response to your article on compression, just wondering if you have come across this idea of compression:
-compute the SHA has of a file of any length. This is the compressed version of the file, stored along with the size of the file.
Then reverse engineer the file by systematically building every possible combination of 0's and 1's for files of the the original length.
Calculate the SHA for the file, and when the correct file has been found, obtain the number of iterations that were needed. The number of iterations is transmitted as part of the compressed file.
With quantum computers *only* 20 years away, this could become a reality.

So, why doesn't this work?

First, hashes are not unique. For any given hash function, there will be an infinite number of collisions. They may be hard to find, but they must exist.

Consider a hash function with n-bit hashes.

Now, consider files of length m bits for m > n.

There are 2m files of length m, but they must share 2n hashes.

Collisions are inevitable, which is why hash functions are also known as irreversible functions.

Second, the number of iterations required to reach the right file is exactly equal to the contents of the file in binary.

In other words, storing the number of iterations plus the hash would be larger than the original file.

Third, while a quantum circuit could compute all of the hashes of a file simultaneously, these hashes would be entangled with each other.

It would not be able to reveal which inputs hashed to the supplied value.

Robin offered a new proposal:

Then, instead of storing the number of iterations, could you store the first and last Megabyte of the file?
When reconstructing the file, you check the SHA and the start and end of the file.
Of course you need to ensure that the SHA hash being calculated was unique under these circumstances. If SHA couldn't cut it, maybe another algorithm, or combination of algorithms could?
Ie. You could calculate the SHA of the file, and then calculate the SHA of the reversed file - providing 2 hashes that could be used together to identify the correct file.

This does not (indeed, must not) work either.

The principle reason is the same: any hash into a finite number of bits cannot guarantee uniqueness for all files greater than that number of bits.

Supposed you have files of length m bits.

Store k bits from this file (anywhere in the file) with the hash.

How many possible files of length m share these same k bits?

2m-k files.

Once again, however, you have only 2n hashes available, which means collisions are still inevitable.

Some files of length m must compress to the same value, even with k bits stored.

Therefore, the compression is not guaranteeably reversible.

Even if you use multiple hash functions, one with i-bit hashes and the other with j-bit hashes, they have no more effect than a single hash function with i+j-bit hashes.