Chapter 10. Switching Gears

Until now, this book has focused mainly on specific data compression algorithms and how they generally work. Even though it’s all highly informative, unless you’re trying to write your own breakthrough data compressor, it’s primarily useful as a foundation for understanding and compressing your data. So, we’d like to switch gears and talk about the pragmatic points of data compression, and how they relate to you, the projects you develop, and the world at hand.

There are two types of compression out there right now: media-specific and general-purpose. Let’s look at each of them.

Media-Specific Compression

Media specific compressors are designed specifically for media data such as images, audio, video, and the like. Most likely, these types of files and compressors make up the majority of content your applications send, receive, manipulate, store, and display to users. The old saying, “A picture is worth a thousand words,” is quite literally true when it comes to data compression: a 1024 x 1024 RGB image is 3 MB of data. If you assume ASCII-encoded letters, you could display 3,145,728 letters for that same size. To put that into context, the famous book The Hobbit is made up of 95,022 words. If you assume an average word size of 5 letters, that’s roughly 475,110 characters. You could fit that book about 6 times into a single 1024 × 1024 image.

This is why most media compressors employ lossy compression algorithms. Lossy compression algorithms are types of data transforms that reduce the quality of the media in an attempt to make the content more compressible. For example, a 1024 × 1024 image, using 8 bits each for the red, green, and blue channels, comes to 24 bits per pixel, and hence 3 MB. However, if you used only 4 bits per channel, you’d end up with 12 bits per pixel, bringing the total footprint to 1.5 MB, while also reducing the color-quality of the image.1

There is an unending horde of lossy data-transforms out there, each one specialized for a specific media type (what works on images won’t work nearly as well on audio) and content type (grayscale images can use a different compressor than full-color images). Ignoring that the transforms are lossy, media-specific compressors pretty much follow what we have discussed so far. After the content is transformed into a more compressible state, you can apply all the standard transforms, such as LZ, BWT, RLE, Delta compression, and even Huffman/arithmetic/ANS. The trick is (again) finding the right transforms for the right type of data to produce the best results.

We haven’t spent much time talking about lossy transforms in this book. That’s intentional. There are so many of these transforms, and each one is so content-specific, that you’d really need a book per mediatype.2 But don’t worry, Chapter 12 covers some important details of how you can optimize your image compression, without having to go too deep into the details.

General-Purpose Compression

General-purpose compressors, on the other hand, are built for everything else. These are algorithms like DEFLATE, GZIP, BZIP2, LZMA, and PAQ, which combine various lossless transforms to produce savings for nonmedia files like text, source code, serialized data, and other binary content that won’t tolerate lossy data compression. There’s a healthy amount of research in this area. Stopping by the Large Text Compression Benchmark shows a gaggle  of general-purpose compressors that have all been tasked with compressing huge text files, to measure how they stack up against each other. And new algorithms continue to be developed. Google’s stabs at improving the GZIP algorithm produced a family of compressors called Snappy, Zopfli, Gipfeli, and Brotli,3 with each one focusing on either better compression, lower memory requirements, or faster decompression.

Most of the Internet content you download each day has been compressed with one of these algorithms. The standard HTTP stack allows for data packets to be encoded with GZIP, BZIP, and now, Brotli compressors (as long as the server and client support it), which means webpages, JavaScript files, tweets, and store listings are most likely showing up on your device after being decompressed.

It’s worth pointing out that many in the compression community, including the authors of this book, believe that these algorithms are caught in a race to a point of diminishing returns. Looking at the most recent, successful encoders (Brotli, LZHAM, LZFSE, ZSTD), they all show a similar trend: minor variations on a theme. New tricks and modifications to existing transforms are applied to existing algorithm types in order to get small savings in compression. And they require more resources to begin producing marginal savings. Looking at various benchmarks, we’re not seeing breakthroughs in 30%–50% savings; rather, lots of sweat and effort is being expended to get 2%–10% improvement over existing algorithms.

Compression in Practice

We hope that with all the awesome knowledge in this book, applying all of this to the development of your applications will be pretty straightforward. But there are a few things we specifically want to cover, which can help you understand more about addressing the lowest-hanging fruit in your application development. The next few chapters will focus on understanding how to evaluate various types of data compression options, and then give you some tips on images and serialized data, followed by some bigger-picture thinking about the importance of data compression over the next decade.

Fun stuff!

1 We are basically reducing the number of possible colors from about 16 million to 4,096. The human eye can distinguish more than a million colors. For more on this, see the Wikipedia entry “Trichromacy”.

2 Not convinced? Go read up on how the JPG format works, or brush up on the new WebM format.

3 Basically, a Swiss bakery shop of algorithms.