Introducing Iguana: extremely fast SIMD-optimized decompression

by Piotr Wyderski | May 30, 2023

Extreme performance takes no prisoners, often to the point of reinventing the wheel. As many wheels have already been invented, one might wonder what makes ours unique. Let’s spare a thousand-word answer for later and start with a single picture.

Benchmark

Yup, there’s no mistake. Thanks to a format specifically designed to take full advantage of the modern SIMD extensions, the decompression performance is 13 gigabytes per second on a single AVX-512-capable core (11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz). Or six GiB/s, with optional entropy compression. That is almost two times the performance of LZ4, still at the compression ratio close to 4:1. And that is 244% faster than zstd when using entropy compression with only slightly less favourable compression ratios. Without entropy compression, Iguana is 6 times faster than zstd.

While Sneller primarily employs Iguana to compress logging data sets, the algorithm performs well on the widely accepted Silesia corpus benchmark too:

Silesia corpus

This huge performance boost is possible thanks to a novel encoding of the input streams and a careful implementation.

Background

Our objective at Sneller is to query large quantities of data incredibly fast. Alas, as no data item can be processed before it is ingested, efficient data delivery is of paramount importance. Two factors can be identified as the ultimate bottlenecks:

  1. The bandwidth of the physical networking hardware.
  2. The pace at which the application can consume the delivered input.

The cloud-native nature of Sneller makes the physical bandwidth largely a constraint we cannot address. We are restricted to using whatever the prevalent Cloud infrastructure providers can offer us within our budget constraints. Even though many instances come equipped with impressive single or double 100 gigabit links – a bandwidth that belonged to the realm of science fiction not long ago – it is still merely 25 gigabytes per second. This value is already below the current Sneller’s engine saturation point, and we plan to continue raising the bar even higher.

Compression to the rescue

Since increasing the physical bandwidth is not viable, the next best thing to consider is raising the effective throughput. Simply put: compress data. The format Sneller uses under the hood is derived from Amazon Ion which is a binary form of JSON. It is convenient to the point of making us able to process it at the AVX-512 assembly level directly. It is much more compact than the textual JSON as well. The compactness is, however, restricted to the clever representation of individual items: no form of extracting the redundancy typically present across data items is exploited. Feeding ION input into a universal compression pipeline could make the dog hunt again. And it is indeed the case – our survey of compression algorithms confirms that ION files do compress well.

Compression is a mixed blessing, though. While the effective bandwidth increases as intended, the computational cost of compression and decompression also does so. Replacing the rock with a hard place only helps a little overall: the network is no longer a problem, but decompression leaves little computing power for the actual data processing. The overhead varies from one algorithm to another, but it is clearly visible in all cases. The unquestionable king of the hill nowadays is zstd, but even with this sophisticated and cleverly optimized algorithm, the decompression overhead is in the ballpark of 48%, as perf traces can witness.

One option would be to optimize the best zstd implementation available, and Sneller followed this route. Substantial speedup has been observed, but the algorithm’s complexity makes the optimization effort challenging to progress beyond a certain point. While the algorithm can achieve excellent compression ratios, this feature remains inaccessible to us by an impenetrable wall of decompression cost.

Lizard

Designing a competitive compression algorithm from scratch is not trivial, so surveying promising candidates looked like a good thing. The most promising option was lizard - efficient compression with very fast decompression. Its performance figures were impressive but not close to what we wanted at Sneller. On the bright side, its implementation was more straightforward than that of zstd, and some helpful documentation of its internals was also available. The high-level structure is as follows:

  1. The top-level part is an LZ77 derivative with a neatly designed constellation of tokens.
  2. At the bottom, there is an optional entropy encoder based on the Finite State Entropy algorithm.

Vectorization stopped as suddenly as promptly it started. No matter how clever the algorithm was, the number of dependencies it inherited from its predecessors made it inherently scalar. Modern processors can execute instructions out-of-order and – in lucky cases – in parallel, but the sustainable speedup is still pretty low. Matching the capabilities of the ever-hungry downstream data-crunching pipeline cried for a radically different approach. If anything at all, only the AVX-512 SIMD extension could help us.

Scalar and SIMD do not mate

However scalar dependencies and SIMD processing don’t mate. Specifically, four types of problems make vectorization of Lizard next to impossible:

  1. A straightforward implementation results in many unpredictable branches, incurring a significant branch misprediction penalty.
  2. Literals are mixed with variable-length uints, making it impossible to infer the semantics of an arbitrary bitstream portion without interpreting all the preceding tokens. Only the current state of the Lizard state machine can resolve the ambiguities. Not an issue for a scalar implementation of the decoder, but it suffices to make a vectorized one fighting a losing battle.
  3. A similar flaw haunts the variable-length unsigned integer encoding. Should a byte be interpreted as a part of a longer run of payload bytes or as a marker of the value that follows? Again, only the state machine can provide a decisive answer.
  4. The FSE algorithm is a member of the tANS arithmetic coding family. The t, standing for tabled, indicates one cause of performance problems. Not only does the algorithm need to traverse large state tables, but the stream consumption also occurs at the granularity of individual bits instead of bytes the processor can work with so much more efficiently. The scenario is about as bad as possible for a modern SIMD-vectorized processor like the AVX-512-capable x64 chips.

Introducing Iguana: vectorized decompression

The hardware-assisted masked execution of the AVX-512 instructions alleviates the first problem. While not ideal, a mostly branchless execution is possible with meticulous implementation. Only the loop-like branches remain, but these can be accurately predicted most of the time and hence incur little to no execution time penalty.

The second problem is unfixable without altering the Lizard stream format. But since interoperability was not a design requirement, the Iguana compression was born. With this constraint removed, the fix becomes trivial: we barely need to split Literals_Stream into two separate streams. One contains the actual literals and the other solely the varuints, so the semantic dependence on the decoder state is removed.

The third problem survives the splitting, however. The encoding used by Lizard and many other LZ77 derivatives is remarkably dense under typical input data statistics:

  • Values not exceeding 253 are encoded as a single byte carrying that value.
  • Values exceeding 253 and not exceeding 65535 are encoded as a byte 254 followed by two little-endian bytes carrying the value.
  • Values exceeding 65536 are encoded as a byte 255, followed by three little-endian bytes carrying the value.

Alas, it is not self-synchronizing: it is trivially decodable by scalar code, but no vectorized parser of reasonable complexity can extract multiple values in one go. There is no way but to change the encoding, constituting another incompatibility with Lizard.

From base256 to base254

Many candidates for a self-synchronizing varuint encoding exist, but to keep the density of the original encoding, the prefix scheme is retained. The decoding ambiguity is caused by the presence of the 254 and 255 bytes in the payload, so these are prohibited from occurring there. This step effectively transforms the original encoding from base256 into base254. The encodable range drops from 16777216 to 16387064, but since these values encode run lengths, the entire 24-bit range is not strictly required. We need just as much capacity as necessary for compact encoding of typically occurring run lengths, and almost 16-megabyte-long runs are hardly seen in practice. If the capacity were insufficient, Iguana (and Lizard, just slightly later) would emit two or more tokens to cover the excessively long run. The bitstream remains valid, so it is not a big deal.

From tables to ranges (or: tANS to rANS)

Concerning the fourth problem, saturating the memory bus with massive amounts of gather/scatter requests and competing for cache capacity with the actual data being processed is precisely what should be avoided at all costs in a high-performance scenario. Luckily, tANS has an older sibling, the rANS or range ANS in full form. It is conceptually exquisite and enjoys a compact implementation. However, its author does not praise it enough due to its heavy demand for efficient integer multiplication hardware. But not only do the AVX-512-capable processors have efficient vectorized multipliers, these multipliers operate independently from the load/store units.

Final words and Future plans

Iguana is a part of the open-source Sneller so you can check it out if you are interested in the details. In addition to the optimized AVX-512 assembly implementation, it also contains a reference implementation written fully in Go.

We are using the Iguana compression already in Sneller Cloud for very fast and efficient log analysis. We are planning on releasing it as a standalone compression library for Golang in the near future.

Appendix A: Benchmarking

Two forms of benchmarking are relevant for comparative performance analysis. One is to rely on a standardized input, and currently, the Silesia corpus is the weapon of choice. The zstd and lz4 command line tools have a built-in option ‘-b’. Since Iguana still lacks a stand-alone compression utility, Phil Hofer created the compbench program to allow performance comparison. Using the tool is as simple as:

$ go install github.com/SnellerInc/compbench@latest
$ compbench

Note that currently, Iguana depends on the AVX-512 instruction set with the VBMI2 extension. In the absence thereof, a scalar, non-optimized reference implementation in Go is activated, and the results — while still correct — would be somewhat far from the depicted values in terms of performance. A variant not relying on the VBMI2 extension is now subject to code review, so running the decompressor on any AVX-512-capable machine will soon be possible, including Skylake-X. The early benchmarking figures are encouraging, suggesting that no less than 90% of the VBMI2 performance is to be recovered. However, we plan to stop there, as the AVX-512 support is required by the Sneller product anyway.

While Iguana performs (surprisingly) well in the universal compression case, the other benchmarking form takes the ION context into account. Sneller can use both zstd and Iguana to compress the data sets, so a comparative benchmark is available in the zion directory: go test -bench . The zion data format usually gives Iguana roughly 50% higher decompression performance and compression ratio than in the Silesia case.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!