lz77ppm

A simple implementation of the LZ77 compression algorithm in C.

Academic project, Progettazione e Produzione Multimediale (PPM), Laurea Magistrale in Ingegneria Informatica, Università di Pisa.

This project is written in C, following the C99 standard, using GCC, make and doxygen.

Overview

This project implements an LZ77 compressor, consisting of a static library (which contains all the core functionalities) and an executable program (which acts as a command-line interface to the compression algorithm).

The compressor follows the standard LZ77 algorithm. A look-ahead buffer points to a certain position on the input data — to the beginning at the start. The algorithm looks for the longest match between the first bytes of the look-ahead buffer and any block of the same length inside a sliding window (which is moved over the data itself). If a match is found, the substring is replaced with a pointer to the start of the match (i.e. the offset from the beginning of the window) and the length of the match; otherwise, a symbol is copied to the output. Then, the window is shifted by the number of consumed bytes, i.e. the length of the match or the size of a symbol.

The library

The core compression routines are implemented as a small C library. They can be used to compress and decompress data from/to streams which are backed by a memory buffer or by a socket/file descriptor. That is, one can compress data stored in a buffer and send the output directly through a socket to a receiver, which in turn can then decompresses the socket stream directly to, e.g., a file.

The main requirements of the library can be summarized as follows:

it shall be able to (de)compress in-memory data;
it shall be able to (de)compress data from/to a file or socket;
it shall handle files up to 4 GiB (and possibly greater);
it shall run on Ubuntu and FreeBSD and should be as much portable as possible.

The command line interface

A command line program is also provided, which can be used to compress and decompress files up to (and actually greater than) 4 GiB. Furthermore, it can be instructed to receive the input from stdin and send the output to stdout, enabling it to be chained with other commands using pipes.

The main requirements of the CLI program can be summarized as follows:

it shall handle files up to 4 GiB (and possibly greater);
it shall support input and output from/to a pipe;
it shall run on Ubuntu and FreeBSD and should be as much portable as possible.

The memory footprint of the executable file (with the library linked statically) is around 160 KiB (on a 64-bit system).

Directory structure

liblz77ppm/: contains all source files of the library;
liblz77ppm-test/: contains a few tests for the library;
lz77ppm/: contains source files of the command line interface to the library;
doc/: will contain documentation produced by Doxygen and other documentation files:
bin/: will contain all executables;
lib/: will contain the static library liblz77ppm.a;
Doxyfile: configuration file for Doxygen;
Makefile: used by make to build everything.

Compiling

Target	Group	Output
`library`	`all`	`lib/liblz77ppm.a`
`cli`	`all`	`bin/lz77ppm`
`test-library`	`test`	`bin/liblz77ppm-test`
`documentation`	–	`doc/html/...`
`clean`	–	Remove object and dependency files
`distclean`	–	Remove all generated files

Required dependencies:

m (C math library)

For faster execution, make sure to build (on branch master) without assertions and with optimization flags enabled:

#!bash
CFLAGS='-O3 -DNDEBUG' make distclean cli

The whole project has been built and tested on the following operating systems and architectures:

Ubuntu 14.04, x86_64

#!bash
$ uname -srpo
Linux 3.13.0-32-generic x86_64 GNU/Linux
$ python -c "import sys; print(sys.byteorder)"
little
$ bash --version
GNU bash, version 4.3.11(1)-release (x86_64-pc-linux-gnu)
$ make --version
GNU Make 3.81
$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
$ ar --version
GNU ar (GNU Binutils for Ubuntu) 2.24

FreeBSD 10.0, x86_64 (VirtualBox)

#!bash
$ uname -srpo
FreeBSD 10.0-RELEASE amd64
$ python -c "import sys; print(sys.byteorder)"
little
$ bash --version
GNU bash, version 4.3.24(0)-release (amd64-portbld-freebsd10.0)
$ gmake --version
GNU Make 3.82
$ cc --version
FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610
Target: x86_64-unknown-freebsd10.0
Thread model: posix
$ ar --version
BSD ar 1.1.0 - libarchive 3.1.2

Debian Wheezy, PowerPC (QEMU)

#!bash
$ uname -srpo
Linux 3.2.0-4-powerpc unknown GNU/Linux
$ python -c "import sys; print(sys.byteorder)"
big
$ bash --version
GNU bash, version 4.2.37(1)-release (powerpc-unknown-linux-gnu)
$ make --version
GNU Make 3.81
$ gcc --version
gcc (Debian 4.6.3-14) 4.6.3
$ ar --version
GNU ar (GNU Binutils for Debian) 2.22

Running

After building the project, all programs are placed in the bin/ folder. It also contains many example files and a couple of shell scripts that can be used to perform the compression and decompression on a given file and check that they execute successfully:

#!bash
bin/liblz77ppm-test
bin/lz77ppm
bin/test-file.sh
bin/test-pipe.sh

Implementation details

Without claiming to be exaustive, in this section are discussed a few key aspects regarding the implementation of the compressor.

The input and output buffers

In order to avoid unnecessary copies, the sliding window and the look-ahead buffer are implemented as pointers to the input data buffer. The input data can be obtained either from a file descriptor or from a given memory buffer. In the former case, the algorithm allocates a fixed-size buffer, which is filled as needed with data read from the file descriptor. In the latter case, instead, no buffer is allocated, since the algorithm directly works on the original one.

Also the output can be backed by a memory buffer or by a file descriptor. In the former case, the user can provide a preallocated buffer of a certain size, or let the algorithm allocate it as needed. If the output does not fit into the size of the buffer, the algorithm needs to reallocate it, but this is not always possible with a user-allocated buffer (the user himself specifies so to the algorithm).

The sliding window and the look-ahead buffer

As already stated, in order to avoid unnecessary copies, the sliding window and the look-ahead buffer are implemented as pointers to the input data buffer. Their size can change from file to file — when compressing a file, it can be specified by the user — enabling a better compression based on the properties of the data. Actually, the user specifies the maximum size of the sliding window and look-ahead buffer, since their current size can vary during the execution of the algorithm depending on the iteration.

For instance, at the beginning the window is empty and its size is zero, since no data has been already processed. Notice that this differs from many other implementations of the LZ77 compressor, in which the window is initially filled by a given "dictionary". This would enable a slightly better compression for the first bytes, but would make impossibile to work on a given user buffer without further copies.

The figure below shows how the sliding window and look-ahead buffer change during the execution of the algorithm.

Wmax = 4, Lmax = 2

Encoding the match length with Huffman

Since matches in the sliding window are more likely to have short lengths, a Huffman compressor can be employed to compress them. However, a generic static Huffman is not feasible, since it would require two passes (the first pass for counting the number of occurrencies) and therefore the need to store the produced output or to process again the whole input. Similarly, an adaptive version of Huffman would be too overwhelming. The chosen implementation is a precomputed table.

Explicit codes are created for values smaller than a given maximum (8), assuming greater probabilities (and so shorter codes) for smaller values, as shown in the following table.

Value	Code	N. of bits
0	000 000	6
2	11	2
3	10	2
4	01	2
5	001	3
6	000 1	4
7	000 01	5
8+	000 001	6

Notice that a value of 1 is never produced, since that match would be encoded as a symbol token. Instead, the length of 0 is used to mark the end of the stream, and hence it has a long code.

If the length of the match is greater than or equal to 8, the code of 8 is first produced (acts as a prefix). Then the difference between the actual length and 8 is emitted on a fixed number of bits, which is properly sized to accomodate values up to Lmax − 8. The number of bits used for this difference can change from file to file, since the size of the look-ahead buffer Lmax is chosen by the user.

Actually, the implemented version is slightly different, because not only the maximum difference Lmax − 8 can vary, but also the minimum possible encoded value (min_value). Since the size of the sliding window and of the look-ahead buffer is specified by the user, the minimum encoded value for the length is not always 2, but depends on the relation between the size of a phrase token and a sequence of symbol tokens. As the sliding window and the look-ahead buffer scale up, the widths of their respective fields in a phrase token increase too, causing a sequence of (for instance) 2 symbols to be better encoded using two symbol tokens instead of a single phrase token. Therefore, a value of 2 would never be used, and its code can be reused for values of 3, allowing a little improvement in the efficiency of the compressor.

Value	Code	N. of bits
0	000 000	6
`min_value`	11	2
`min_value`+1	10	2
`min_value`+2	01	2
`min_value`+3	001	3
`min_value`+4	000 1	4
`min_value`+5	000 01	5
`min_value`+6 ÷ Lmax	000 001	6

Mantaining a tree to speed-up search in the window

Looking up in the sliding window using a simple linear search is an expensive operation, since it costs O(W * L). Therefore, a binary search tree is constructed on top of the sliding window, reducing the operation to a logarithmic complexity.

In principle, the tree contains all the words of the window, with lengths ranging from 1 to Lmax. However, since any word can be considered as a prefix of another maximum-length word, the size of the tree can be scaled down by keeping only words with length L. The dictionary has therefore a maximum size determined solely by the size of the sliding window W and can be allocated just once as a single array.

Each node in the tree contains a word (of length L) and the indices to the parent and child nodes. Actually, instead of storing the whole word, a simple offset to the beginning of the word inside the sliding window is sufficient. Furthermore, the information about the offset can be associated implicitly to each element of the array.

#!cpp
struct tree_node {
    uint16_t parent;
    uint16_t smaller_child;
    uint16_t larger_child;
};

When a search in the tree is performed, at each step the algorithm needs to obtain the word associated to the current node, and besides, any new word from the window has to be added to the tree in a specific position of the array of nodes. Suppose that the word (of length L) starting at position k in the sliding window is implicitly associated, with a certain function, to the i-th element of the array. Such function is not simply i = k, because the window actually slides over the input data and hence k increases far beyond the maximum allowed i (i.e., W − 1).

If b is the offset of the window inside the input data buffer, the word at position k in the window is added to the array of nodes at position i = (b + k) mod W. In other words, the array of nodes is used like a circular array.

On the other hand, when the algorithm steps through the tree during a search and needs to obtain the word associated to the current node, the offset is calculated as k = (i − (b mod W) + W) mod W, where W is added to ensure that the dividend of the outer modulo operation is always positive (so that, regardless of the C standard, the result is always positive).

Data	Tree

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
doc		doc
liblz77ppm-test/src		liblz77ppm-test/src
liblz77ppm		liblz77ppm
lz77ppm/src		lz77ppm/src
.gitignore		.gitignore
Doxyfile		Doxyfile
Makefile		Makefile
README.md		README.md
UNLICENSE		UNLICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lz77ppm

Overview

The library

The command line interface

Directory structure

Compiling

Running

Implementation details

The input and output buffers

The sliding window and the look-ahead buffer

Encoding the match length with Huffman

Mantaining a tree to speed-up search in the window

About

Releases

Packages

Languages

License

antoniomacri/lz77ppm

Folders and files

Latest commit

History

Repository files navigation

lz77ppm

Overview

The library

The command line interface

Directory structure

Compiling

Running

Implementation details

The input and output buffers

The sliding window and the look-ahead buffer

Encoding the match length with Huffman

Mantaining a tree to speed-up search in the window

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages