SMASH: an efficient compression algorithm for microcontrollers

One of the things that is driven from the top in SEGGER is that we can always do better. Not satisfied with standard schemes, we wanted to optimize emCompress, SEGGER’s compression library, for:

Very fast decompression
High compression ratio (uncompressed size divided by compressed size)
Small decompressor
Limited state in RAM when decompressing

With some experimentation, the end result is a coding technique we named SMASH, the Small Microcontroller Advanced Super-High compression scheme. SMASH compresses as efficiently as DEFLATE, uses only a few words of decompressor state, and has an ultra-small, ultra-fast, ultra-simple decompressor.

Why do this?

In discussing the linker and the future direction of our emCompress product, we examined the possibility of improving emCompress and, using that improvement, fold it into the SEGGER Linker. The company ethos is that we develop and use our own software, and therefore it is natural to use emCompress components as a way of compressing data sections held in flash that must be copied to RAM on startup. When we improve emCompress, we also improve the SEGGER Linker, which is a win all round!

LZSS, the ubiquitous algorithm

Standard compression schemes, such as DEFLATE, LZMA, LZJU90, and LZ4, use an LZSS-style kernel to parse the input string into a stream of literals and matches, a match being a back reference to a substring that can be subsituted at the cursor position. Effective compression relies on being able to encode the literals and references in fewer bits than the string being replaced.

Once literals and matches are identified, the encoder takes each and emits them to the bitstream. Typical coding is a leading flag bit, identifying a literal or a match, followed by an 8-bit literal or a (distance, length) pair referencing previous content behind the cursor.

The main differences between all these algorithms is the way that the literal and (distance, length) pairs are encoded and constraints placed on the distance and length of the match because of this encoding.

Encoding with standard algorithms…explained

LZSS-based encoders try to reduce the coding of literal, distance, and length using an additional compression scheme:

Plain binary encoding does nothing to reduce literals, distance, and length, relying entirely on the string parser to detect redundancies.
DEFLATE uses two separate dynamic Huffman codes, one to encode literals and lengths and the other to encode distances.
LZJU90 uses two bit-oriented prefix encodings for distance and length and leaves literals uncompressed.
LZ4 uses a byte-oriented prefix encoding for distance and length and leaves literals uncompressed.
LZMA uses a single range coder to encode literals, distances, and lengths, dynamically adapting its probability model as it encodes.

Let’s examine these for suitability.

Plain binary coding

Fixed-width binary coding of the distance and length does not produce an outstanding result, but it is fast and requires no additional state memory when decompressing. We can discount plain binary encoding as it does not meet the “good compression” requirement.

DEFLATE

DEFLATE offers good compression, and decompression runs in a few kilobytes of RAM. Unfortunately it’s quite complex to decompress using two Huffman codes that are themselves encoded with a canonical Huffman code. It’s a clever solution to compres match distance and length, and it even compresses literals, which LZ4 and LZJU90 do not.

DEFLATE is specified with a minimum match length of three bytes. It stands to reason that you can find many more matches with a length of two than with a length of three. By experimentation, using a nonstandard minimum match length of two rather than three results in worse compression, so the default is the best choice.

We can discount DEFLATE as a candidate as it is complex to decode (around 800 lines of C code) and even 2K of memory is more than you would want to dedicate on a small microcontroller at startup or during application decompression.

LZMA

LZMA offers excellent compression, without question. It has a minimum match length of two and encodes everything very efficiently using a range coder and with some items encoded in a bit tree. Being able to compress efficiently requires very large tables, from 6 kilobytes to 6 megabytes. And unfortunately, the decoder needs to maintain an identical model to the encoder, so decoding also needs from 6 kilobytes to 6 megabytes.

Not only is it RAM-hungry, the decoder is also very complex and is unsuited to assembly coding: the LZMA decoder in emCompress is approximately 1,000 lines of code excluding common emCompress library code.

The complexity of the decoder alone is enough for immediate rejection. This just isn’t going to fly on a small microcontroller as it’s too complex too RAM-hungry.

What it does do is demonstrate that a minimal match of two bytes with efficient encoding does a much better job of compression than a minimal match of three bytes.

LZJU90

This scheme uses prefix coding of length and distance leaving literals uncompressed. The prefix coding schemes for distance and length are different, tuned to the restricted set of match lengths (3 to 256) and distances (1 to 32255).

The restricted match lengths and distance, imposed by the prefix encoding, limit the effectiveness of the format. For instance, consider two identical 128K files concatentaed and then compressed: it is impossible for LZJU90 to encode a 128K match distance and 128K match length to describe that the second 128K of the 256K combined file is identical to the first 128K. We would like to allow this capability…so unbounded match lengths and distances are a win.

Although LZJU90 is simple and therefor small, and requires no complex decoder state, the compression is simply not as efficient as DEFLATE, which is one of our goals. We must, therefore, reject LZJU90 as a candidate–but it does provide a basis for a new scheme.

LZ4

LZ4 is optimized for speed and sacrifices its ability to compress well: its minimum match length is 4 and so it has no possibility of producing well-compressed images. It does not require additional complex decode state and has a simple decoder. But for us, LZ4’s byte-prefix encoding scheme can be rejected because it does not meet the “excellent compression” requirement.

Surely there must be a modern encoder that uses the LZSS scheme but achieves really good compression with a near-zero decode footprint?

SMASH, a new encoding technique

From the above we can characterize what is required, and what must be rejected, when constructing an effective compression scheme:

It’s much more common to find short matches than long matches, and LZMA demonstrates excellent compression with a minimum match length of two bytes. Therefore, we require a minimum match length of two bytes, not three or four, to find as many potential matches as possible, and to encode short matches efficiently.
No complex range coder or Huffman codes as these introduce a large RAM state into the decoder along with additional decoder complexity and therefore a larger ROM footprint.
It must be possible to encode unbounded match lengths and distances to copy both far-flung matches and very long matches.
The prefix coding schemes of LZJU90 offer fairly good compression of lengths and distances, but the limited prefix coding is not suitable for two-byte minimum lengths and unbounded length and distance values.
Compressing literals should not be necessary for random binary data, such as machine code, and we assume that each byte of literal data has an equal probability of occuring. This is an assumption, but a valid one.

We decided to use a prefix encoding, inspired by LZJU90, in our SMASH algorithm. Match lengths of 2 to 4 bytes, which are very common, are specificially coded with high density (just two bits), with longer lengths using a modified prefix encoding where the “continue/end” flag is infix rather than prefix and is always present (unlike LZJU90). The distance encoding has a similar style of prefix coding optimized for these values.

Using a set of standard benchmark inputs, and some drawn from typical linked object code in the SEGGER Linker, we settled on a scheme that delivers excellent results. Here’s a sample from the SEGGER Linker, when compressing approximately 50K of code and data that must run in RAM:

LZMA:    30,704 bytes, 60.4% of original
DEFLATE: 32,592 bytes, 64.1% of original
SMASH:   32,878 bytes, 64.7% of original
LZJU90:  34,793 bytes, 68.4% of original
LZ4:     38,175 bytes, 75.1% of original

Althougb this is a single sample, it is representative of the effectiveness of each of the schemes. Of course, some compression algorithms will do very well on degenrate data whilst others will fare less well. However, SMASH achieves an overall compression comparable to DEFLATE whilst at the same time being much simpler and requiring a much smaller decoder state. SMASH and DEFLATE trade places depending upon input data, but are usually within 1% of each other on identical data sets, with 0.5% being common.

Whilst SMASH is not as efficient as LZMA, it is an order of magnitude less complex, measured by source code lines, and does not require decode state measured in multi-kilobytes or even megabytes for efficient compression.

When time permits, and with the final release of SMASH made, a follow-up post will detail the relative decompressor performance and object code sizes of the various algorithms.

What’s next for SMASH?

There is a possibility to improve the compression effectiveness of SMASH further by tuning the string parser that finds matches, although it is still fast and efficient using some coding heuristics. Without significant effort, it is not possible to find an optimal encoding efficiently as the bit length of the (distance, length) pair is value dependent, not constant.

There are a few other tweaks to both the string parser, the compressor, and the encoded format that we haven’t described here, nothing revolutionary: but small efficiencies add up to a significant result.

Conclusion

SMASH compresses well, decodes quickly, requires only a few words of decoder state held in machine registers, and is small. It has compression that is better than LZJU90 and LZ4 and is comparable with DEFLATE. The decoder is far simpler than these formats delivering an ultra-small ROM footprint.

SMASH is ideal for minimal-footprint decompression on the target and will be added to both the SEGGER linker and startup code in Embedded Studio for decompression of constants and code copied to RAM during startup and to emCompress, our general purpose compression software for embedded systems and desktop computers.

SEGGER Blog

What's Cooking at SEGGER?