Chapter 11. Evaluating Compression

Before bounding off into the weeds and jamming compression into every part of your fancy application, it’s important to note all of the trade-offs and use cases involved. Not every algorithm is suited for every use case, and in some instances, a different implementation of the same compression format might better match your need.

So, when it comes to data compression, what matters?

Compression Usage Scenarios

Let’s begin this discussion by setting the stage on where data is compressed, stored, and decompressed. This is critical for understanding where the data is coming from and where it’s going, because of the important interplay between encoder and decoder, which we’ll talk about more in a bit. First, let’s look at four common scenarios.

Compressed Offline, Decompressed On-Client

In this first scenario, the data is compressed somewhere unrelated to the client, and then distributed to the client, where it’s decompressed for use.

This scenario is most common for things like packaged applications or video games, and the resources often include copious amounts of images, videos, and music. Another use case is artists creating and sharing their work. In both cases, the original art is created by using high-resolution, high-fidelity tools, and then exported and compressed for distribution.

Compression aims to optimize for the smallest possible media files.

The trade-off is one of quality.

Compressed On-Client, Decompressed In-Cloud

Most modern social media applications generate a lot of content on the client and then push it up to the cloud for processing and distribution to other fellow social users. In these situations, some mild compression is usually done on the client, in order to reduce the amount of overhead in the outbound communication. For example, taking a packet of social data, and serializing it with a binary serialization format, and then gzip compressing it before sending it up to the server.

The primary goal here is to reduce cost for the user. Although many in North America (still) have the benefit of being on an unlimited data plan,¹ much of the rest of the world doesn’t enjoy this luxury, so most clients are on pay-for-transfer plans. This means that every bit they transfer to the server is paid for by them.

The trade-off is that on mobile, you spend battery resources compressing this data.

Compressed In-Cloud, Decompressed On-Client

Data that is compressed in the cloud falls into two primary buckets, which have entirely different characteristics.

Dynamic data that is generated by the cloud resource

On the flip-side of users sending data to the cloud is data that originates in the cloud that the user downloads.

For example, when a client requests the results of some database operation, or the server sends dynamic layout data, the client is waiting for content to be generated. The time it takes for the server to generate and compress that data is critically important; otherwise, the client will experience network latency.

What’s essential here is balancing size and compression time. It is worth pointing out that in some high-latency environments, users might be willing to wait the extra time in return for a smaller payload.

Compression aims to minimize what’s being sent over the network.

The trade-off is one of time.

Large data that’s passed off to the cloud for efficient computing

The importance in this scenario is often pushed toward ensuring minimization of bits for the medium at hand. For example, imagine having two gigabytes of PNG files that need to be converted to WebP images at 10 different resolutions, or 1,200 hours of video that must be converted to H.264 before being displayed.

Remember every bit sent from the cloud must be paid for by the cloud owner, and in effect, each bit that the client has to consume from the cloud also has to be paid for by the client. Using cloud-compute resources is ideal in situations for which you want to minimize those bits in the most compute-effective way possible.

The goal is to efficiently compress large amounts of data into the fewest possible bits.

The trade-off is one of cost versus efficiency (that is, the price of compute resources.)

Compressed On-Client, Decompressed On-Client

Finally, there are many client applications that need to communicate with one another. They might send peer-to-peer network packets, or photos, or GPS information.

In these cases, the client generates the data, compresses it, and then sends it to the other client to decompress.

What’s difficult about these situations is that the client machine, often a mobile device, typically doesn’t have the massive amount of resources required to optimally transform and compress data. However, these devices usually have specialized graphics hardware, which can be used to compress things like JPG and H.264, and you end up with a win for images and video. For other types of data, this means that the quality of the data compression will be lower (due to less time available to optimize), and that the decompression time can be slower as well (due to less power on client devices).

As such, algorithms that communicate client-to-client often settle on fixed-bit compression, such as manually packing serialized structures rather than compressing them.

There isn’t a trade-off as such; instead, there’s a balancing act between the capabilities of the device, the time it takes to compress and decompress, and the immediacy of the need for the data.

Compression Need

As we’ve already hinted while exploring the different algorithms, it’s exceptionally important to understand that not all compression algorithms and formats apply to all data types. For example, applying Huffman to image data won’t produce the level of savings that applying a lossy image² compression algorithm will.

As a developer, matching the right algorithm to the right data type is critical for maximizing the compression results you want, with the trade-offs you’ll need to make. And there is no silver bullet. It comes down to this:

Know your data—not just what type of data, but also its internal structure, and in particular, how it’s used.
Know your algorithm options so that you can chose from the right family of compressors.³
Most important, know what you need for the given circumstance, because you might find surprising savings there.

What does this mean in practice?

For instance, thumbnail images may be able to tolerate a lower quality level than, say, full-screen versions. As such, they could be compressed with a lossy JPEG encoder, whereas the higher-quality version should be encoded by using a lossless WebP codec.

Compression Ratio

Compression ratio, that is, the amount by which the content is compressed, is often the most important factor when evaluating compression options. Because the primary goal of compression is to produce the most compact form of the data, because the smallest number of bits on the wire will always win.

Except...there are always exceptions. You might be willing to to settle for less compression savings when performance or memory take priority. Here is a perfect example: ZPAQ will generally produce the smallest compressed files for 1 GB text files. But it also generally requires around 2 GB of memory and 3 hours to compress this on a desktop machine, and similarly for decompression. So yes, ZPAQ is a great when it comes to file size, but it’s not very applicable to compressing data on a mobile device.

Now, for services that can do compression offline, or in the cloud, compression ratio is one of the most important considerations. You do have the extra resources and time for compressing your data to its smallest form, and there is cold, hard cash involved in sending around the bits.

Users Get the Short End

It’s worth pointing out that users always get the short end of the stick in this equation. Most of the world isn’t on “unlimited data” plans, and the cost for the users to bring down 1 GB of data, compared to the cost for someone to serve 1 GB of data, are monumentally different.

If you want to build an app that keeps users happy, try to shoulder the cost of transfer on behalf of your users.

Compression Performance

Compression performance is all about how long it takes to get data into a compressed form. Compression performance is critically important in any latency-sensitive environments, whether the client is responsible for compression, or the server is compressing data for which the client is waiting.

There are generally two evaluation metrics to care about in this regard: CPU speed and memory. The CPU speed of the encoding system is important because this determines how fast the data can be compressed. And the amount of available memory matters as soon as it’s very limited, as happens to be the case on mobile devices.

For example, the LZMA algorithm achieves really impressive compression results, but at the cost of large memory footprints. This makes the algorithm less than attractive to use on mobile devices, on which there might be only 256 MB of RAM available.

Most client technologies (mobile devices anyway) have built-in support for hardware compressor codecs (coder-decoder technology), at least for some lossy data types. Things like JPG and H.264 are easily transferred to hardware, and as such, client-side compression here is easier than on servers, which typically don’t have that specialized hardware.

On the lossless side, we’ve even seen a few GZIP chips out there, too. Because these are dedicated hardware components, they tend to be lightning fast compared to their software implementation counterparts, and you should take advantage of them any chance you get. Remember that client-side compression has much more limited resources, and if you can shift around your strategy to optimize for this performance, it usually results in a net win. Especially in places where serialized data that has to be sent around for peer-to-peer (P2P) networking, updating packet and position data every frame results in a load of data, and utilizing GZIP hardware can result in large wins.

Decompression Performance

For any performance-sensitive environments, decompression speed is the metric that rules all things. In modern application development, decompression is usually done on a client device that is quite underpowered compared to its server-side counterparts. Compression algorithms that can produce the smallest size can also come with the penalty of taking the longest time to decompress, and that makes them unusable if the data is sent to mobile devices.

The trade-off then, is that sometimes you must choose a compression algorithm based upon its decompression performance rather than just its compressed size. For example, ZPAQ, a very efficient compressor, uses a neural net as its codec and thus requires an enormous amount of runtime resources to decompress. Such requirements make it off-limits for mobile devices with smaller CPUs and battery constraints.

The WebP image format is another perfect example of this. The first few versions of WebP boasted better-quality images at smaller data sizes than JPG, but the decoding speed was almost doubled. Because of this, many companies hesitated to adopt the format. Performance improved in later versions, and mass adoption eventually happened.

Hardware decoders, which are common in laptops and mobile devices, are picking up some of the slack. Hardware JPG, OGG, and H.264 chips are boosting decode performance for their specific algorithms, making them a preferred choice under some circumstances.

Decode performance is actually one of the dominant reasons why GZIP is one of the most common archival compression algorithms on the planet right now. GZIP produces good general compression sizes at really fast decompression speeds, making it applicable for all sorts of embedded and nonembedded devices. Over the past 20+ years since its creation, the GZIP algorithm has constantly been improved to take decompression performance to new levels.

Ability to Decode-Stream

Data streaming is often an overlooked aspect of decompression. We generally think of compression algorithms as working with a “whole package,” meaning that all the data needs to be in memory before any decoding can occur.

But that’s actually far from the truth. Think of listening to an entire opera or watching Kenneth Branagh’s Hamlet in one session: taking in all of it at once can be too much—better to process parts of it at a time.

Luckily, some general compression algorithms, such as GZIP, BZIP2, alongside most media compressors, like H.264, also work in a streaming mode. Data can be sent to the client in chunks that are decoded as they arrive (i.e., block encoding). For many client-side applications, this ability is highly sought after. Imagine a user scrolling through the beginning of a social media stream, and having to download the last 10 years of “what I ate for breakfast today” posts before being able to decode the “I finally planked Stonehenge” announcement.

Comparing Compressors

There are so many compression formats and algorithms out there, it’s sometimes good to get them in a head-to-head battle to see which ones win on a given type of data.

Fortunately, you don’t have to do this yourself. For example, the Large Text Compression Benchmark regularly pits algorithms head-to-head for 1 GB files of text data. The Squash Compression Benchmark tests various types of XML, text, images, and other data formats. And Squeeze Chart compares all sorts of text, audio, and bitmap content.

The main point of all this is that various compressors with various settings will influence the amount of compression quality your app can take advantage of. So, in all cases, try to test a handful of options based upon your restrictions and goals, and pick the right one for you.

The “Weissman Score”

I gotta hand it to Mike Judge, who really helped out the compression industry. Since the 1980s, attention and gains in compression have been relatively small and slow. Sure, we had BWT in the ’90s, and LZMA in the ’00s, and ZPAQ in the ’10s, but other than that, it’s been pretty quiet. However, in a single burst of laughter, suddenly the nerd world is interested in compression again. What happened?

In 2014, the TV show Silicon Valley burst onto the scene with much hilarity. The show follows a programmer’s journey into the startup world as he tries to build a company around a revolutionary new compression algorithm. While satire, at its core, the impact on the compression community has been pretty amazing. Suddenly, it’s cool to care about compression again. The media has been eating it up, and any story of a company doing something interesting with compression immediately gets compared to the TV show. So when Google released a new algorithm, the media took the opportunity to discuss art, imitating life, imitating art, and all that jazz.

Although the compression algorithm presented in the show is all fiction, the producers wanted to have a hint of truth in things, so they contacted professor Tsachy Weissman at the University of Stanford to help them walk that line. Now, Prof. Weissman created a method for measuring the performance of data compression, which takes the compression ratio of the data set and divides it by the encoding speed of the data set. The intention is to test out the performance of new compression algorithms by normalizing their ratio/encode speed by known existing encoders (like GZIP). This normalization provides some ability to compare algorithms against universal standard compressors, which can be helpful in the evaluation of the right system for the right data type.

This new metric was named in his honor by the show creators as the Weissman Score, and since has become the stuff of legend on the Internet, although it’s not clear whether the score has found roots in practical use in the compression world. (Or, at least no one on encode.ru seems to be using it yet…)

The takeaway here is that with a great new attention in the data compression space, we’re hopeful to see a new generation of algorithms and research focused on this area to help move compression into a new generation of breakthroughs.

¹ This is changing while we are writing this book.

² Algorithms that lose information during the compression process. See Chapter 14 for some words on lossy compression.

³ More specifically, algorithms that work for blocks of text might not compress numerical data well, and vice versa.

Previous Chapter

10. Switching Gears

Next Chapter

12. Compressing Image Data Types