Glossary of Compression Words

A file archiver with a high compression ratio.

Alpha blending

The process of combining an image with a background to create the appearance of partial or full transparency.

Alpha channel

An additional channel (in addition to RGB) that contains a value between 0 and 1 that designates the transparency for each pixel.

Alpha transparency

The transparency value for a pixel as transmitted in the Alpha channel.

Arithmetic coding

An algorithm for encoding data, wherein frequently used characters are stored with fewer bits and not-so-frequently occurring characters are stored with more bits, resulting in fewer bits used in total. Rather than assigning codewords to symbols in a 1:1 fashion, this algorithm transforms the entire input stream from a set of symbols to one (excessively long) numeric value, whose LOG2 representation is closer to the true value of entropy for the stream.

See Chapter 5.

Asymmetric numeral systems (ANS)

A modern variant of statistical compression, which has shown early promise in achieving close to entropy compression with performance comparable to Huffman coding.

See Chapter 5.

Binary or base 2 number system

A way of representing numbers using only the digits 0 and 1, and each shift in position is by a power of 2. For example, the decimal number 5 would be 101 or 2⁰ + 2².

See Chapter 2.

Binary erasure channel

A communications channel model used for analysis in information theory. The idea is that a bit is never wrong. It either exists and is correct or it is “erased.”

Binary search

An algorithm for finding a target value in a sorted array.

Bitwise exclusive OR (XOR)

bitwise = operates on each bit independently

exclusive OR (XOR) = a logical operation that outputs TRUE only when both inputs differ (one is TRUE, the other is FALSE).

Block codes

Any error-correcting code that encodes data in blocks.

Blocking

The act of subdividing a set of data into smaller “blocks” for the purpose of better compression.

Block sorting compression

The name for a performant application of the Burrows–Wheeler transform, which will block separate the data stream, and apply BWT to each block, rather than the entire stream.

BMP file

A file format for bitmaps (simple raster, encoded images with RGB channels).

BSON

A binary version of the JSON serialization format.

Burrows–Wheeler transform (BWT), block-sorting compression

Example of a reversible transform that rearranges a character string into runs of similar characters.

See Chapter 8.

BZIP/BZIP2

Free and open source file compression program that uses the Burrows–Wheeler transform to compress single files.

Bytecode

Instruction set made up of compact numeric codes designed for efficient execution by a software interpreter.

Cap’n Proto

A binary serialization format.

Channel

The means by which information is transmitted.

Claude Shannon

An American mathematician, known as the “father of information theory.” He is also the reason why this book exists. So, read this Wikipedia article to find out what a mess he got us into.

Coder

See encoder.

Codec

Short for coder-decoder. A device or computer program capable of encoding or decoding a digital data stream or signal.

Coding theory

Science of finding explicit methods, called codes, for increasing the efficiency and reducing the error rate of data communication over noisy channels to near the channel capacity. These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques. A third class of information theory codes are cryptographic algorithms.

Communication

The purposeful activity of information exchange between two or more participants in order to convey or receive the intended meanings through a shared system of signs and semiotic rules. The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point.

Context modeling

The process of using multiple information signals about a piece of data to infer what the best type of compression algorithm is to apply to it.

Contextual compressor

Compressor that determines the output symbol based on the probability of the input symbol in the current context.

See Chapter 8.

Convolutional code

A type of error-correction code used to increase the reliability of data transfers.

Cryptographic algorithms

Algorithms to encode information with the purpose of making transmission more secure: keeping the content secret.

Data compression

Representing the information in a data stream with fewer bits than the original stream. For example, writing “Rolling on the floor laughing” as “ROFL” compresses the original from 29 to 4 characters, which is a savings of 86%. (This also hints at the fact that having context can increase compression rate.)

Data stream

A block of data that’s intended to be consumed in small groups at a time rather than having access to every part of the stream at once. A real-life example is listening to music on a radio.

Decoder

Part of compression software that is responsible for translating the compressed stream back into uncompressed data.

DEFLATE

A popular compression algorithm that utilizes LZ and statistical encoding to achieve compression.

Dictionary encoding

The process of transforming your data stream based upon most common symbol groupings.

See Chapter 7.

Dynamic range of data set

For the purpose of this book, the number of bits needed to represent every value in the data.

Encoder

Part of compression software that is responsible for translating the source information into compressed data form.

Entropy

See information entropy.

Error-correcting code

A code attached to a chunk of information that can be as simple as confirmation that the information is correct, or complex enough to “fix” errors.

Error correction

Attaching a code to a chunk of information that makes it possible to detect and correct errors in transmitted data. Error detection and correction codes increase the reliability of information and make it more resilient to noise. Error correction is orthogonal to compression.

Finite state entropy (FSE)

A practical implementation of asymmetric numeral systems, which is focused more dominantly on improved performance.

Flatbuffers

An efficient open source cross-platform serialization library originally created at Google for game development and other performance-critical applications.

Grouping

In the context of compression, assigning bits to a group of symbols instead of to individual symbols. For example, in text, you could encode the 100 most common words. Finding out which strings/substrings to group is its own challenge.

GIF

An image compression format known for its Alpha transparency support and animated image usage.

gzip (GNU zip)

Unpatented compression utility widely used on the Internet.

H.264

Video coding format that is currently one of the most commonly used formats for the recording, compression, and distribution of video content. H.264 is perhaps best known as being one of the video encoding standards for Blu-ray discs and video streaming. The format is patented. H.264 is typically used for lossy data compression in the strict mathematical sense, although the amount of loss can sometimes be imperceptible. It is also possible to create truly lossless data compression using it—e.g., to have localized lossless-coded regions within lossy-coded pictures, or to support rare use cases for which the entire encoding is lossless.

Histogram

A diagram consisting of rectangles whose area is proportional to the frequency of a variable, and whose width is equal to the class interval.

HTTP protocol stack

The suite of protocols that make up the HyperText Transport Protocol (HTTP).

Huffman coding

Prefix-based lossless data compression encoding.

See Chapter 5.

Information

For our purposes, the content of a message. Information can be encoded into various forms for transmission and interpretation (for example, information can be encoded into a sequence of symbols or transmitted via a sequence of signals).

Information resolves uncertainty. The uncertainty of an event is measured by its probability of occurrence and is inversely proportional to that. The more uncertain an event, the more information is required to resolve uncertainty of that event.

Information content

The actual information contained in a datastream (versus noise). See also information entropy.

Information entropy

The average number of bits needed to store or communicate one symbol in a message.

See Chapter 3.

Information theory

The branch of mathematics, electrical engineering, and computer science involving quantification of information. Information theory studies the transmission, processing, utilization, and extraction of information.

ITU

Short for International Telecommunication Union.

JPG/JPEG

Lossy data compression format widely used for digital images.

JSON format

JavaScript Object Notation (JSON) is a data-interchange format that in addition to being easy for machines to parse and generate, is easy for humans to read and write.

Key-value pairs

Representing data as collection of pairs—for example, [word, definition] or [row, value].

Laplace estimator

A formula used to estimate underlying probabilities when there are few observations, or for events that have not been observed to occur at all in (finite) sample data.

Lempel–Ziv and Lempel–Ziv–Welch algorithms

A family of lossless algorithms for tokenizing data streams.

See Chapter 7.

Least significant bit (LSB)

The bit in a binary number that has the smallest numerical value.

For example, in the binary number 1000, the LSB is the right-most digit with the value 0.

Lexicographic order

A generalization of the way the alphabetical order of words is based on the alphabetical order of their component letters.

Lexicographic permutation

Clustering groups of the same symbol near each other. See also BWT.

List decoding

The main idea behind list decoding is that the decoding algorithm, instead of outputting a single possible message, outputs a list of possibilities, one of which is correct. This allows for handling a greater number of errors than that allowed by unique decoding.

Linear correlation

A relationship or connection between two things based on co-occurrence or pattern of change, where if one of them changes, the other one changes linearly. For example: if the temperature increases, ice cream sales also increase. Most important: correlation does not imply causation.

Literal stream

A data stream that only holds literal (unencoded) values for symbols.

See Chapter 6.

Literal token

An output token that indicates that the next symbol should be read/written from/to the literal stream.

See Chapter 6.

Lossless data compression

Applying compression techniques where no information is lost during compression, and the original information can be fully and exactly reconstructed.

Logic synthesis

In electronics, a process by which an abstract form of desired circuit behavior is turned into a design implementation in terms of logic gates.

Lossy data compression

Refers to compression techniques wherein some information is lost during compression, and the original information cannot be fully and exactly reconstructed. The goal of lossy compression is to find a balance between maximum compression and sufficient fidelity to the original data.

LZA

Archiver based on the LZ77 algorithm.

LZMA

Archiver based on the LZ77 algorithm.

LZ77, LZ78

A family of algorithms that achieves compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. Other implementations include LZFSE (download), LZHAM, and LZTurbo.

Locality-dependent skewing

We totally made this term up, and what we mean by it is skewing that is different at different locations of the data stream.

Markov chain

Named after Andrey Markov, a random process that undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as “memorylessness”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it.

Mathematical origami

The mathematical study of origami.

Most significant bit (MSB)

The bit in a binary number that has the highest numerical value. In the binary number 1000, the MSB on the left-most side of the number has a value 8.

Move-to-front (MTF)

A data transformation algorithm that is locally adaptive.

See Chapter 8.

MSGPACK

Small and fast binary serializaton format.

Mumbo jumbo

What’s inside Fibonacci encodings.

Multicontext coders

Algorithms that weave together multiple symbols and statistical tables or models in order to identify the least number of bits needed to encode the next symbol.

See Chapter 9.

Multiset

A set where multiple occurrences of the same element are considered.

Mutual information

A measure of information that is in common between two random variables.

n-grams

A contiguous sequence of n items from a given sequence of text or speech.

Noise

Anything that corrupts a signal or information between the source and the recipient.

Nonsingular codes

A code is nonsingular if each source symbol is mapped to a different non-empty bit string.

Normalization

Adjusting values measured on different scales to a notionally common scale.

Permutation

In mathematics, the notion of permutation relates to the act of rearranging, or permuting, all the members of a set into some sequence or order.

Prediction by Partial Matching (PPM)

An algorithm based on Markov chains with many variants, including PPM*, PPMD, and PPMZ.

See Chapter 9.

Prefix code

A code is a prefix code if no target bit string in the mapping is a prefix of the target bit string of a different source symbol in the same mapping. This means that symbols can be decoded instantaneously after their entire codeword is received. Prefix codes are always nonsingular and uniquely decodable.

Prefix property

This property prescribes that after a code has been assigned to a symbol, no other code can start with that bit pattern. A required property for variable-length codes.

Probability distribution

A statistical function that describes all the possible values and likelihoods that a random variable can take within a given range. This range will be between the minimum and maximum statistically possible values, but where the possible value is likely to be plotted on the probability distribution depends on a number of factors, including the distributions mean, standard deviation, and skewness.

Program synthesis

Automatically generating a program that satisfies a set of requirements.

Protocol buffers, protobuffs

Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data.

Quantization

The procedure of constraining something from a continuous set of values (such as the real numbers) to a relatively small discrete set (such as the integers).

Quantification

The act of counting and measuring that maps human sense observations and experiences into members of some set of numbers. For example, expressing the level of noise at a concert in decibels.

Qbit, qubit

Short for quantum bit. The basic unit of information in a quantum computer. Also used to designate a very, very small amount of something.

Rate

The average entropy per symbol.

Range coding

An algorithm that does basically the same as arithmetic coding but is free of patents.

Redundancy

Words or data that could be omitted without loss of meaning or function; repetition or superfluity of information. For example, in “there were ten (10) ducks in the pond,” “(10)” is redundant.

In information theory, the number of bits used to transmit a message minus the number of bits of actual information in the message.

Run-length encoding (RLE)

RLE takes advantage of the adjacent clustering of symbols that occur in succession. It replaces a “run” of symbols with a tuple containing the symbol and the number of times it is repeated.

See Chapter 8.

Serialization

The process of converting objects or data structures into strings of bits that can be stored or transmitted. It is implied that the original object can be reconstructed using deserialization.

Shannon–Fano coding

Technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured). It is suboptimal in the sense that it does not achieve the lowest possible expected codeword length like Huffman coding; however, unlike Huffman coding, it does guarantee that all codeword lengths are within one bit of their theoretical ideal.

Sources

Objects that encode message data and transmit the information, via a channel, to one or more receivers. Any process that generates successive messages. Also called the “sender.”

Statistical compression

Compression that determines the output symbol based on the probability of the input symbol.

See Chapter 5.

Statistical skewing

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Or in plain language, a measure for how much more probable some symbols are than others. For example, the English alphabet is skewed in that the letter “e” is a lot more common than the letter “q”. For compression, skewing is a good thing, and some data transforms manipulate the set of symbols to increase skewing.

tANS

A variant of ANS or arithmetic numerical systems described in Jarek Duda’s paper “Asymmetric Numeral Systems: Entropy Coding Combining Speed of Huffman Coding with Compression Rate of Arithmetic Coding”.

See Chapter 5.

Tokenizing a stream

Assigning symbols to the contents of a stream. For example, in lexical analysis, tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. In the context of compression, it’s about finding best way of assigning a dictionary of “words” to a stream.

Transformation of data

In the context of compression, making changes to a data stream (without changing the information content) to make it more compressible. For example, in [123456,123457,123458], the delta from number N to number N + 1 might require fewer bits than N + 1, such as [123456,1,1]. Finding the right transformation for a given datastream is in itself a big challenge.

Trie

In computer science, a trie, also called “digital tree” and sometimes “radix tree” or “prefix tree” (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

Unary encoding

An entropy encoding that represents a natural number, n, with n ones followed by a zero (if natural number is understood as non-negative integer) or with n−1 ones followed by a zero (if natural number is understood as strictly positive integer). For example, 5 is represented as 111110 or 11110. The ones and zeros are interchangeable.

Unicode

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The latest version of Unicode contains a repertoire of more than 120,000 characters covering 129 modern and historic scripts, as well as multiple symbol sets.

Uniquely decodable codes

A code is uniquely decodable if its extension is nonsingular—which means that the target symbols are uniquely identifiable.

Universal code

A way of creating variable-length codes for positive integers by mapping each integer to a unique binary encoding. In general, the smallest integers are assigned the fewest bits.

Variable-length codes or VLC

Codes that use encodings that vary in length. Usually the shortest encodings are applied to the most common symbols in the data set.

See Chapter 4.

Video codec

An electronic circuit or software that compresses or decompresses digital video, thus converting raw (uncompressed) digital video to a compressed format or vice versa. In the context of video compression, “codec” is a concatenation of “encoder” and “decoder”; a device that can only compress is typically called an encoder, and one that can only decompress is known as a decoder.

XML

XML, which stands for Extensible Markup Language, defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

XOR

See bitwise exclusive OR operation.

ZIP file format

An archive file format that supports lossless data compression. It is not a compression algorithm.

Previous Chapter

15. Making the World a Little Smaller

Next Chapter

Index

Table of Contents for Understanding Compression

Glossary of Compression Words

Table of Contents for
Understanding Compression