Impossibly good compression

Every so often, a company or a rogue entrepreneur claims to have invented a "perfect" compression algorithm--an algorithm that can supposedly always reduce the size of a file. If such a magic algorithm actually existed, then it could be applied repeatedly to any file to make it smaller and smaller. You would be able to download a DVD over a dial-up connection in a second! Perhaps it's the allure of such fantastic benefits that keeps a steady stream of suckers investing in these doomed ideas. (Companies claiming to have invented unbreakable encryption and perpetual motion machines show up at about the same rate.) It's fairly easy to show that an algorithm that can always compress a file leads to a logical contradiction: that is, such an algorithm cannot possibly exist.

A proof via the pigeon-hole principle

Suppose Bob tells you that his program, Ultrazip, can compress any file to at most some fraction p of its original size. For example, if Bob says he can give you guaranteed compression of 80% (p = 0.8) and you give Ultrazip a 100 byte file, you will always get back a file of 80 bytes. (Applying it again would give you back a file of 64 bytes.)

Let's examine Ultrazip on all files of length n bits. Essentially, Bob's program is a function, P, from the set of files of length n to the set of files of length pn. There are 2n files of length n. We'll call this set of files A. There are 2pn files of length pn. We'll call this second set of files B. Since the size of the set A is greater than the size of the set B, the program P must map several files in the set A to the same file in the set B.

This style of reasoning is called the pigeon-hole principle. The pigeon-hole principle observes that if you have n pigeons, but less than n holes, then at least two pigeons will be sharing a hole.

Thus, the program P cannot represent a one-to-one function, and hence, it cannot have an inverse. Consequently, no reliable decompression program can exist for files compressed with P, so Bob's program either does not exist, or it is useless.

An example of the case n = 3

There are 8 possible files of bit length 3, because 23 = 8:

000
001
010
011
100
101
110
111

Apply the Ultrazip program to these files that gives 66% compression. Now, all 8 files have length 2. However, there are only 4 possible files of bit length 2, because 22 = 4:

00
01
10
11

No matter how you try to map the 3-bit set of files to the 2-bit set of files, at least one 2-bit file will represent more than one 3-bit file.

Consequently, information is lost, and the "compression" is irreversible.

The default counter-refutation: The b bits lower limit

A lot of miraculous compression programs claim that they beat the mathematical theory by not being able to compress files below a certain limit. Suppose that b bits is that lower limit. (1024 is a popular choice for this lower limit for some reason.) Defeating this claim is easy. Just try to compress all files of length b + 1. There are 2b+1 of these files, but only 2b files to map them onto. By the same pigeon-hole argument as before, at least two of those original files, which started out differently, will have the same compressed form, and hence, they'll be undecompressible.

How does compression work in practice?

Of course, compression seems to work in practice. First of all, not all compression is lossless, such as image compression. Secondly, if the information in question has structure to it, i.e., some files are more likely to be compressed than others, an average (but not guaranteed) compression can be computed. Take the 3-bit example again. Suppose that in practice, the only documents that Ultrazip will ever have to compress are 000 and 101. Then Ultrazip can just yield 0 for 000 and 1 for 101--an astounding "33%" compression. However, if you were to try to compress, 001, Ultrazip would have to choose at least a two-bit value, and you would not get a 33% compression rate.

Most every kind of information that humans deal with has a predictable structure to it. Compression programs take advantage of that structure, and if it's not there, compression programs can't make the file any smaller on average. In fact, for any real compression program, there will be at least one file whose "compressed" version is larger than its uncompressed version.

You can test this proposition empirically: try generating random files and running them through your favorite compression utility.