Next to image data, serialized content is the second most common data format you’ll be sending around in your networked applications. And even though the lowest-hanging data compression fruit will clearly come from image data, it’s equally important to take a hard look at serialized content.
What do we mean by “serialized”? Serialization is the process of taking a high-level data object and converting it to a binary string (the inverse is deserialization). This transform can be applied to a plethora of different data types, but it’s most accurate when describing the conversion from an in-memory structure or class to a file or memory binary large object (BLOB) to send over a network.
This particular use case dominates the mountain of data transfers we see from modern mobile and web applications. Consider your favorite social media app. When you load it for the first time, a flurry of serialized data is passed between the client and the server in order to show you the right information on the screen. And this continues as you receive updates, news, and messages. When you post your own updated status, this input has to go into memory, be serialized to a format, uploaded to the server, which will deserialize it, add it to its database, and then serialize it again in order to send updates to all of your friends.
Although images take up the bulk of your data compression footprint by size, serialized content makes up for it in volume.
This means that performance is critical for serialization speed, deserialization speed, and the resulting file sizes that need to be sent around to millions of users. To provide some assistance here, we’re going to look at the common serialized file formats XML and JSON, and discuss some techniques you can apply to make them smaller for your users.
It’s important to have a clear understanding of how your serialized content is being used, because this can have a large impact on decisions you make vis-à-vis compressing that data. Here are the most common use cases.
This is the most common type of serialized data that exists in modern mobile applications. A client object typically queries a server, perhaps asking for the results of a database operation, to which the server computes the results, serializes the content, and sends it back to the client for deserialization. In this process, the serialized data is typically compressed further by the HTTP protocol stack (for example, using GZIP), which helps to reduce the overall file size. The overhead of this decompression time on the client is well worth it, given the size savings.
Although dynamically built data is common, applications typically use static serialized content as well; for example, sending the client configuration files for the latest build. The author can update these files on the server on a semiregular basis, and that’s usually done offline. As such, the server simply views these files as static and passes them off to the client upon request. Again, these files tend to be further compressed by the HTTP stack.
In many situations, the client will send information to some server, in which case the creation of this serialized information occurs on the client. This means that the overhead for performing serialization and the data compression process reside entirely on the client device. For laptops or desktops, this might not be an issue, but for mobile phones, tablets, and wearable devices, this can spell big trouble over time. In addition, because those devices tend to be lower powered, there’s typically less of a desire to spend client resources on hyper-compressing data further for upload. This creates a unique balancing act that developers will need to work out for their specific applications
Finally, there is data that resides on and is used locally by the client; for example, layout information that’s authored once and then loaded many times without further changes. This information is quite easy to compress, typically during the build-time of an application, while extra machine power is available. The only thing the client needs to do is keep the data resident (on persistent storage) and load its content into memory on demand.
The two biggest serialization formats used today are JSON and XML. This is mostly due to their adoption by the web platform over the past 20 years. Although easy to use, and hugely popular, these formats present some very specific compression issues.
One of the draws of JSON and XML is that they are (more or less) human readable. That is, if you opened the post-serialized file in your text editor, you’d be able to read the entire thing, as demonstrated in this random JSON snippet:
{
"base": {
"reboot": { ...omitted for brevity... },
"updateBaseConfiguration": { ...omitted for brevity... }
},
"robot": {
"jump": {
"parameters": {
"height": {
"type": "integer",
"minimum": 0,
"maximum": 100
},
"_jumpType": {
"type": "string",
"enum": [ "_withAirFlip", "_withSpin", "_withKick" ]
}
}
},
"speak": {
"parameters": {
"phrase": {
"type": "string",
"enum": [ "beamMeUpScotty", "iDontDigOnSwine", "iPityDaFool",
"dangerWillRobinson" ]
},
"volume": {
"type": "integer",
"minimum": 0,
"maximum": 10
}
}
}
}
}
As you can see, this is done by representing the entire file as a set of string values, cobbled together by tokens, to define how everything is related.
The benefit is an amazingly flexible format (almost any data structure can find a way to be properly serialized to these formats), but the downside is a massive amount of overhead in order to include all that human-readable information.
Looking at the preceding JSON snippet, a large number of spaces, line breaks, and string quotes are included, simply to make this file more human readable. As a result, the encoded file is larger, in bits, than it needs to be. The problem becomes worse with numerical data. For example, if your serialized JSON file contains the string “3.141592653589793”, it would be 17 bytes long (or even longer, depending on your character encoding). This is completely insane, considering that the actual floating-point number used to represent this number is only 8 bytes (or 64 bits) long. The human-readable version is more than twice the size of the binary one.
It’s important to note that decode times can often be problematic for these text formats. The reason for this is multifold:
String-based input must be converted to memory objects using heavy-handed operations (for example, converting ASCII symbols to integer numbers is not cheap).
Holding data in temporary memory during load time isn’t always efficient.
Backward compatibility to older formats can slow encoding and decoding.
The takeaway is that formats like XML and JSON, by default, skew toward longer load times in order to properly deserialize on the client. In fact, there’s an array of XML and JSON encoders out there that are entirely focused on reducing load times for specifically organized file types.
With all of this in mind, there’s a few tricks you can employ to help reduce the size of JSON and XML data as it’s being sent to your users.
Easily, the biggest bang for the buck is kicking JSON and XML to the curb, and finding a binary serialization format to go with instead. Binary formats lack the human-readable nature of JSON and XML, but they ensure that the data is encoded in a compact and efficient binary form. The results are smaller files and faster load times.
Even though binary serialization formats are in abundance, some of our favorites lie in the middle ground between wire-size format and decompression time. If you’re willing to define your own schema Protobufs, Flatbuffers, or Cap’n Proto should be the first formats you evaluate for these benefits.
But suppose that you are not ready to abandon the XML or JSON ship, or your boss won’t let you get off the text-based serialization wagon. There are ways in which you can improve your JSON data to serialize it more efficiently and also make it more compressible. Formats such as BSON and MSGPACK keep the correct JSON schema but provide binary sizes for encoding. This would let you get better file size but not have to lift so much of your code to do so.
The real joy of these binary formats is that they produce better compression than their human-readable counterparts, and in some cases, they can actually be compressed further by general-purpose encoders such as GZIP.
Here’s an interesting point. When you’re serializing your data, most of the time, you’re doing so to mirror the in-memory object form of the content. Looking at the next code snippet, consider the structure on the left, and how it’s serialized to JSON on the right.
struct {
int id;
char* name;
int gender;
int age;
char* address;
int employeeID;
}
|
{
"id": 25,
"name": "Hooty McOwlface",
"gender": 27,
"age": 88,
"address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043"
"employeeID": 3025
},
|
The ordering of attributes in the JSON file tends to follow the in-memory representation of the corresponding structure. Although this is fine for ease of programmer maintenance, it doesn’t produce the best compression results after you get an entire list of structures.
First, consider that a JSON object (picking on JSON for a minute) is made up of key-value pairs, where the key portion is repeated for each instance of the structure in the file, adding bloat. In the following list of people and their countries, you need to repeat the “name” and “country” keywords for every single person:
...
{
"name": "Joanna",
"country": "USA"
}{
"name": "Alex",
"country": "AUS"
},
{
"name": "Colt",
"country": "USA"
}
...
For large JSON files that list many elements in this form, the amount of overhead per each occurrence of “name” and “country” contributes a great deal to the final byte size.
Second, recall that encoders like GZIP and its brethren are all based on the LZ algorithm for their primary transform step, meaning that they are most powerful when they can find repeated data patterns in their search window.
Imagine an entire file of such employee data and realize that there are gaps between values that are potential duplicates. For example, the “age” value might be further than the 1 K to 2 K search window away from the next “age” value in the serialized file.
You can address both repetition of keys and distance of similar values by a simple reordering of the list content. You can transpose the previous array structure1 such that all the values for a given key are held in a single array and close together, as demonstrated in the following example:2
{
"name": ["Joanna", "Alex", "Colt"],
"country": ["USA", "AUS", "USA"]
}
This reduces bloat and makes it easier for the LZ algorithm to find matches.
In programming-speak, the truth is that converting from array-of-structs to struct-of-arrays can be a critically important transform for large serialized content. So, if you’re dealing with big JSON or XML files, seriously consider this type of transform.
We can extend the concept of transposing structures a bit further. Do you really need to fetch fully structured data from the server? Or could you instead request each data type separately (and assemble them in the client, if necessary)?
There is a tendency for backend applications to provide a general-purpose API for all of their clients. Although this is a reasonable strategy for backend systems, it’s not good for the client, because the application ends up transferring and processing a lot of data on a small device when some calculations could be made more efficiently on the server farms.
If your application displays a feed of mixed content, ensure that the client can fetch that information in a single request and that the returned data is suitable for caching in pieces. You generally want your client to be able to identify entities so that it can store them persistently, and also avoid duplicates of the same objects in memory.
While doing this type of data fetching, many APIs return hierarchical data where all relations are denormalized. Although this is the preferred approach for most web clients, it is not good for mobile clients for which persisting data and serving it from local storage is important.
Instead of returning hierarchical data, it is better to return normalized data.
Take a look at the following bad example. The same user_id and user_name is duplicated in many places. The client will need to decompose this one big object, extract nested user objects, get rid of duplicates, and store what’s left in the local database or memory cache.
{
"messages" : [{
"from" : {
"user_id" : 1,
"user_name" : "claude",
....
},
"text" : "hello
hello",
"date" : "123"
},
{
"from" : {
"user_id" : 1,
"user_name" : "claude",
....
},
"text" : "how are you",
"date" : "124"
},
{
"from" : {
"user_id" : 1,
"user_name" : "claude",
....
},
"text" : "you there",
"date" : "125"
},
{
"from" : {
"user_id" : 1,
"user_name" : "claude",
....
},
"text" : "hello
hello",
"date" : "126"
}]
}
Now look at this better example:
{
"users" : {
"1" : {
"user_id" : 1,
"user_name" : "claude",
....
}
},
"messages" : [{
"from" : 1,
"text" : "hello
hello",
"date" : "123"
},
{
"from" : 1,
"text" : "how are you",
"date" : "124"
},
{
"from" : 1,
"text" : "you there",
"date" : "125"
},
{
"from" : 1,
"text" : "hello
hello",
"date" : "126"
}]
}
This is much easier for the client because each object is passed only once. The returned “users” hash in the response can easily be used to update the database and in-memory cache.
But we can do even better and completely flatten our hierarchy. Check out this final solution. It’s all the same information, without duplication; compact, and straightforward to process.
{
"users" : {
"1" : {
"user_id" : 1,
"user_name" : "claude",
....
}
},
"messages" : {
“from”: [1,1,1,1],
“text”: [ "hello
hello","how are you","you there","hello
hello"],
}
The more information the client has about the data it is displaying, the more efficient it can be. The application can decide which data to cache or prune and, for example, how to invalidate the layout when new data arrives. A mobile client is a lot more sophisticated than a simple HTML renderer, and you give it due respect by handing it the best possible structured data.
Typically these types of serialized formats, such as JSON and XML, are “junk drawers” for multiple types of data. You can combine integers, strings, floats, even images and sound data, all encoded right into the silly little serialized format.
However, separating out these large data types into their own compressed chunks will yield better compression than letting them be in-line in the file. Think about it. If you have a JSON file with 2,600 inverted indexes, GZIP isn’t going to help you much. However, separating out the indexes, and delta-compressing them first, can yield significant improvements.
It goes likewise for images. There was a scary trend for a while to base64-encode PNG files (that is, to represent the binary data in an ASCII string format) inside of CSS files for doing responsive web design. The use case made sense: It costs more “load time” to make the extra network transfer for the thumbnail than the overhead from transferring the bloated image content inside the CSS file. We don’t condone this action for mobile applications, except in rare cases.
When you are busy figuring out how to create that moment of utter delight for your users, thinking about data compression is probably not at the forefront of your mind. We would like to argue that it should be, at least for a few moments every day. Like with every other bit of app infrastructure, building it into your app development process ultimately takes less work for better results. That will translate pretty directly into happier users, and perhaps, a sweeter bottom line.
Whether you are going to use built-in compressors or no compression at all, or a customized pipeline for each type of data, the important thing is that you make your choices consciously, and based on as much data as you can get your hands on.
Building a strong pipeline for image compression and data serialization can help support your application through its lifetime. Starting with the right mentality for data compression in your development helps keep things slim and thin for your users as you carry on. So do this in the beginning, rather than at the end...OK?
1 Which, technically is called “array of structs” or rather, “a list of data objects.”
2 It’s worth pointing out that this is not a concept unique to serialized content. If you’ve ever had to deal with runtime performance relating to a CPUs L2 Cache residency, the solution is the same.