Because real-world files usually are quite redundant, compression can often reduce the file sizes considerably. This in turn reduces the needed storage size and transfer channel capacity. Especially in systems where memory is at premium compression can make the difference between impossible and implementable. Commodore 64 and its relatives are good examples of this kind of a system.
The most used 5.25-inch disk drive for Commodore 64 only holds 170 kB of data, which is only about 2.5 times the total random access memory of the machine. With compression, many more programs can fit on a disk. This is especially true for programs containing flashy graphics or sampled sound. Compression also reduces the loading times from the notoriously slow 1541 drive, whether you use the original slow serial bus routines or some kind of a disk turbo loader routine.
Dozens of compression programs are available for Commodore 64. I leave the work to chronicle the history of the C64 compression programs to others and concentrate on a general overview of the different compression algorithms. Later we’ll take a closer look on how the compression algorithms actually work and in the next article I will introduce my own creation: pucrunch.
Pucrunch is a compression program written in ANSI-C which generates files that automatically decompress and execute themselves when run on a C64 (or C128 in C64-mode, VIC20, C16/+4). A cross-compressor, if you will, allowing you to do the real work on any machine, like a cross-assembler.
Our target environment (Commodore 64 and VIC20) restricts us somewhat when designing the ‘ideal’ compression system. We would like it to be able to decompress as big a program as possible. Therefore the decompression code must be located in low memory and be as short as possible and must use very small amounts of extra memory.
Another requirement is that the decompression should be relatively fast, which means that the arithmetic used should be mostly 8- or 9-bit which is much faster than e.g. 16-bit arithmetic. Processor- and memory-intensive algorithms are pretty much out of the question. A part of the decompressor efficiency depends on the format of the compressed data. Byte-aligned codes can be accessed very quickly; non-byte-aligned codes are much slower to handle, but provide better compression.
This is not meant to be the end-all document for data compression. My intention is to only scratch the surface and give you a crude overview. Also, I’m mainly talking about lossless compression here, although some lossy compression ideas are briefly mentioned. A lot of compression talk is available in the world wide web, although it may not be possible to understand everything on the first reading. To build the knowledge, you have to read many documents and understand something from each one so that when you return to a document, you can understand more than the previous time. It’s a lot like watching Babylon 5. 🙂
Some words of warning: I try to give something interesting to read to both advanced and not so advanced readers. It is perfectly all right for you to skip all uninteresting details. I start with a Huffman and LZ77 example so you can get the basic idea before flooding you with equations, complications, and trivia.
Huffman and LZ77 Example
Let’s say I had some simple language like “Chippish” containing only the letters CDISV. How would a string like
compress using a) Huffman encoding, and b) LZ77? How do compression concepts such as information entropy enter into this?
A direct binary code would map the different symbols to consequtive bit patterns, such as:
Because there are five symbols, we need 3 bits to represent all of the possibilities, but we also don’t use all the possibilities. Only 5 values are used out of the maximum 8 that can be represented in 3 bits. With this code the original message takes 48 bits:
011 010 001 100 010 000 010 010 010 011 010 001 010 001 100 010
For Huffman and for entropy calculation (entropy is explained in the next chapter) we first need to calculate the symbol frequencies from the message. The probability …