The process of combining an image with a background to create the appearance of partial or full transparency.
An additional channel (in addition to RGB) that contains a value between 0 and 1 that designates the transparency for each pixel.
The transparency value for a pixel as transmitted in the Alpha channel.
An algorithm for encoding data, wherein frequently used characters are stored with fewer bits and not-so-frequently occurring characters are stored with more bits, resulting in fewer bits used in total. Rather than assigning codewords to symbols in a 1:1 fashion, this algorithm transforms the entire input stream from a set of symbols to one (excessively long) numeric value, whose LOG2 representation is closer to the true value of entropy for the stream.
See Chapter 5.
A modern variant of statistical compression, which has shown early promise in achieving close to entropy compression with performance comparable to Huffman coding.
See Chapter 5.
A way of representing numbers using only the digits 0 and 1, and each shift in position is by a power of 2. For example, the decimal number 5 would be 101 or 20 + 22.
See Chapter 2.
A communications channel model used for analysis in information theory. The idea is that a bit is never wrong. It either exists and is correct or it is “erased.”
An algorithm for finding a target value in a sorted array.
bitwise = operates on each bit independently
exclusive OR (XOR) = a logical operation that outputs TRUE only when both inputs differ (one is TRUE, the other is FALSE).
Any error-correcting code that encodes data in blocks.
The act of subdividing a set of data into smaller “blocks” for the purpose of better compression.
The name for a performant application of the Burrows–Wheeler transform, which will block separate the data stream, and apply BWT to each block, rather than the entire stream.
A file format for bitmaps (simple raster, encoded images with RGB channels).
A binary version of the JSON serialization format.
Example of a reversible transform that rearranges a character string into runs of similar characters.
See Chapter 8.
Free and open source file compression program that uses the Burrows–Wheeler transform to compress single files.
Instruction set made up of compact numeric codes designed for efficient execution by a software interpreter.
A binary serialization format.
The means by which information is transmitted.
An American mathematician, known as the “father of information theory.” He is also the reason why this book exists. So, read this Wikipedia article to find out what a mess he got us into.
See encoder.
Short for coder-decoder. A device or computer program capable of encoding or decoding a digital data stream or signal.
Science of finding explicit methods, called codes, for increasing the efficiency and reducing the error rate of data communication over noisy channels to near the channel capacity. These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques. A third class of information theory codes are cryptographic algorithms.
The purposeful activity of information exchange between two or more participants in order to convey or receive the intended meanings through a shared system of signs and semiotic rules. The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point.
The process of using multiple information signals about a piece of data to infer what the best type of compression algorithm is to apply to it.
Compressor that determines the output symbol based on the probability of the input symbol in the current context.
See Chapter 8.
A type of error-correction code used to increase the reliability of data transfers.
Algorithms to encode information with the purpose of making transmission more secure: keeping the content secret.
Representing the information in a data stream with fewer bits than the original stream. For example, writing “Rolling on the floor laughing” as “ROFL” compresses the original from 29 to 4 characters, which is a savings of 86%. (This also hints at the fact that having context can increase compression rate.)
A block of data that’s intended to be consumed in small groups at a time rather than having access to every part of the stream at once. A real-life example is listening to music on a radio.
Part of compression software that is responsible for translating the compressed stream back into uncompressed data.
A popular compression algorithm that utilizes LZ and statistical encoding to achieve compression.
The process of transforming your data stream based upon most common symbol groupings.
See Chapter 7.
For the purpose of this book, the number of bits needed to represent every value in the data.
Part of compression software that is responsible for translating the source information into compressed data form.
See information entropy.
A code attached to a chunk of information that can be as simple as confirmation that the information is correct, or complex enough to “fix” errors.
Attaching a code to a chunk of information that makes it possible to detect and correct errors in transmitted data. Error detection and correction codes increase the reliability of information and make it more resilient to noise. Error correction is orthogonal to compression.
A practical implementation of asymmetric numeral systems, which is focused more dominantly on improved performance.
An efficient open source cross-platform serialization library originally created at Google for game development and other performance-critical applications.
In the context of compression, assigning bits to a group of symbols instead of to individual symbols. For example, in text, you could encode the 100 most common words. Finding out which strings/substrings to group is its own challenge.
An image compression format known for its Alpha transparency support and animated image usage.
Unpatented compression utility widely used on the Internet.
Video coding format that is currently one of the most commonly used formats for the recording, compression, and distribution of video content. H.264 is perhaps best known as being one of the video encoding standards for Blu-ray discs and video streaming. The format is patented. H.264 is typically used for lossy data compression in the strict mathematical sense, although the amount of loss can sometimes be imperceptible. It is also possible to create truly lossless data compression using it—e.g., to have localized lossless-coded regions within lossy-coded pictures, or to support rare use cases for which the entire encoding is lossless.
A diagram consisting of rectangles whose area is proportional to the frequency of a variable, and whose width is equal to the class interval.
The suite of protocols that make up the HyperText Transport Protocol (HTTP).
Prefix-based lossless data compression encoding.
See Chapter 5.
For our purposes, the content of a message. Information can be encoded into various forms for transmission and interpretation (for example, information can be encoded into a sequence of symbols or transmitted via a sequence of signals).
Information resolves uncertainty. The uncertainty of an event is measured by its probability of occurrence and is inversely proportional to that. The more uncertain an event, the more information is required to resolve uncertainty of that event.
The actual information contained in a datastream (versus noise). See also information entropy.
The average number of bits needed to store or communicate one symbol in a message.
See Chapter 3.
The branch of mathematics, electrical engineering, and computer science involving quantification of information. Information theory studies the transmission, processing, utilization, and extraction of information.
Short for International Telecommunication Union.
Lossy data compression format widely used for digital images.
JavaScript Object Notation (JSON) is a data-interchange format that in addition to being easy for machines to parse and generate, is easy for humans to read and write.
Representing data as collection of pairs—for example, [word, definition] or [row, value].
A formula used to estimate underlying probabilities when there are few observations, or for events that have not been observed to occur at all in (finite) sample data.
A family of lossless algorithms for tokenizing data streams.
See Chapter 7.
The bit in a binary number that has the smallest numerical value.
For example, in the binary number 1000, the LSB is the right-most digit with the value 0.
A generalization of the way the alphabetical order of words is based on the alphabetical order of their component letters.
Clustering groups of the same symbol near each other. See also BWT.
The main idea behind list decoding is that the decoding algorithm, instead of outputting a single possible message, outputs a list of possibilities, one of which is correct. This allows for handling a greater number of errors than that allowed by unique decoding.
A relationship or connection between two things based on co-occurrence or pattern of change, where if one of them changes, the other one changes linearly. For example: if the temperature increases, ice cream sales also increase. Most important: correlation does not imply causation.
A data stream that only holds literal (unencoded) values for symbols.
See Chapter 6.
An output token that indicates that the next symbol should be read/written from/to the literal stream.
See Chapter 6.
Applying compression techniques where no information is lost during compression, and the original information can be fully and exactly reconstructed.
In electronics, a process by which an abstract form of desired circuit behavior is turned into a design implementation in terms of logic gates.
Refers to compression techniques wherein some information is lost during compression, and the original information cannot be fully and exactly reconstructed. The goal of lossy compression is to find a balance between maximum compression and sufficient fidelity to the original data.
Archiver based on the LZ77 algorithm.
Archiver based on the LZ77 algorithm.
A family of algorithms that achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. Other implementations include LZFSE (download), LZHAM, and LZTurbo.
We totally made this term up, and what we mean by it is skewing that is different at different locations of the data stream.
Named after Andrey Markov, a random process that undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as “memorylessness”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it.
The mathematical study of origami.
The bit in a binary number that has the highest numerical value. In the binary number 1000, the MSB on the left-most side of the number has a value 8.
A data transformation algorithm that is locally adaptive.
See Chapter 8.
Small and fast binary serializaton format.
What’s inside Fibonacci encodings.
Algorithms that weave together multiple symbols and statistical tables or models in order to identify the least number of bits needed to encode the next symbol.
See Chapter 9.
A set where multiple occurrences of the same element are considered.
A measure of information that is in common between two random variables.
A contiguous sequence of n items from a given sequence of text or speech.
Anything that corrupts a signal or information between the source and the recipient.
A code is nonsingular if each source symbol is mapped to a different non-empty bit string.
Adjusting values measured on different scales to a notionally common scale.
In mathematics, the notion of permutation relates to the act of rearranging, or permuting, all the members of a set into some sequence or order.
An algorithm based on Markov chains with many variants, including PPM*, PPMD, and PPMZ.
See Chapter 9.
A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Prefix codes are always nonsingular and uniquely decodable.
This property prescribes that after a code has been assigned to a symbol, no other code can start with that bit pattern. A required property for variable-length codes.
A statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This range will be between the minimum and maximum statistically possible values, but where the possible value is likely to be plotted on the probability distribution depends on a number of factors, including the distributions mean, standard deviation, and skewness.
Automatically generating a program that satisfies a set of requirements.
Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.
The procedure of constraining something from a continuous set of values (such as the real numbers) to a relatively small discrete set (such as the integers).
The act of counting and measuring that maps human sense observations and experiences into members of some set of numbers. For example, expressing the level of noise at a concert in decibels.
Short for quantum bit. The basic unit of information in a quantum computer. Also used to designate a very, very small amount of something.
The average entropy per symbol.
An algorithm that does basically the same as arithmetic coding but is free of patents.
Words or data that could be omitted without loss of meaning or function; repetition or superfluity of information. For example, in “there were ten (10) ducks in the pond,” “(10)” is redundant.
In information theory, the number of bits used to transmit a message minus the number of bits of actual information in the message.
RLE takes advantage of the adjacent clustering of symbols that occur in succession. It replaces a “run” of symbols with a tuple containing the symbol and the number of times it is repeated.
See Chapter 8.
The process of converting objects or data structures into strings of bits that can be stored or transmitted. It is implied that the original object can be reconstructed using deserialization.
Technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured). It is suboptimal in the sense that it does not achieve the lowest possible expected codeword length like Huffman coding; however, unlike Huffman coding, it does guarantee that all codeword lengths are within one bit of their theoretical ideal.
Objects that encode message data and transmit the information, via a channel, to one or more receivers. Any process that generates successive messages. Also called the “sender.”
Compression that determines the output symbol based on the probability of the input symbol.
See Chapter 5.
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Or in plain language, a measure for how much more probable some symbols are than others. For example, the English alphabet is skewed in that the letter “e” is a lot more common than the letter “q”. For compression, skewing is a good thing, and some data transforms manipulate the set of symbols to increase skewing.
A variant of ANS or arithmetic numerical systems described in Jarek Duda’s paper “Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding”.
See Chapter 5.
Assigning symbols to the contents of a stream. For example, in lexical analysis, tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. In the context of compression, it’s about finding best way of assigning a dictionary of “words” to a stream.
In the context of compression, making changes to a data stream (without changing the information content) to make it more compressible. For example, in [123456,123457,123458], the delta from number N to number N + 1 might require fewer bits than N + 1, such as [123456,1,1]. Finding the right transformation for a given datastream is in itself a big challenge.
In computer science, a trie, also called “digital tree” and sometimes “radix tree” or “prefix tree” (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.
An entropy encoding that represents a natural number, n, with n ones followed by a zero (if natural number is understood as non-negative integer) or with n−1 ones followed by a zero (if natural number is understood as strictly positive integer). For example, 5 is represented as 111110 or 11110. The ones and zeros are interchangeable.
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets.
A code is uniquely decodable if its extension is nonsingular—which means that the target symbols are uniquely identifiable.
A way of creating variable-length codes for positive integers by mapping each integer to a unique binary encoding. In general, the smallest integers are assigned the fewest bits.
Codes that use encodings that vary in length. Usually the shortest encodings are applied to the most common symbols in the data set.
See Chapter 4.
An electronic circuit or software that compresses or decompresses digital video, thus converting raw (uncompressed) digital video to a compressed format or vice versa. In the context of video compression, “codec” is a concatenation of “encoder” and “decoder”; a device that can only compress is typically called an encoder, and one that can only decompress is known as a decoder.
XML, which stands for Extensible Markup Language, defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
An archive file format that supports lossless data compression. It is not a compression algorithm.