Columnar Compression Without Columns

by Phil Hofer | March 27, 2023

Many modern data storage systems support transparent compression of your data. Although the most obvious reason to compress your data is to reduce its footprint in storage, data compression can also increase the end-to-end performance of your system once you take into account the available CPU, networking, and storage resources. Concretely, a query engine like Sneller that uses object storage as its primary storage back-end is frequently going to have to move data across a network, and network bandwidth is often a limiting factor for overall system performance.

As an example, imagine you are downloading a zstd-compressed file from S3 object storage. If we try to max out our available bandwidth by renting a 200Gbps-capable c6in.32xlarge instance, we can download data from S3 at a maximum rate of 200 Gbps or 23.28 GB/s. But, a c6in.32xlarge machine has 128 CPU cores, and we can decompress zstd data at over 1 GB/s/core (in decompressed bytes). That means in theory we can turn the 23.28GB/s downloaded stream into 128GB/s or more of decompressed data, assuming we’ve achieved a compression ratio of at least 5.49 (128 / 23.28). This math extends to reads from local storage, too. If you can read 4GB/s of sequential compressed data from an NVMe disk, and you can decompress the data blocks on multiple CPU cores, you can likely multiply your realized read throughput by a nice constant factor.

Given the I/O advantages of compressing your data, it’s not surprising that most columnar databases (and columnar storage formats) have some kind of native support for data compression. Strictly-typed columnar databases typically apply compression to your data after striping your data into columns. Consequently, a clever query engine can determine statically which columns need to be fetched in order to evaluate a particular query, and thus the query engine only has to bear the cost of decompressing some of the data. Of course, if you run a query like SELECT * FROM my_table, you’ll have to decompress every column.

Since Sneller’s query engine is fundamentally row-oriented, and since Sneller supports arbitrary heterogeneous rows, we cannot employ exactly the same tricks as a “pure” columnar storage format. (Consider: Sneller allows users to provide entirely disjoint sets of fields in adjacent rows. One thousand rows with ten fields each that are unique to the row would imply that there are ten thousand “columns” just for those one thousand rows!) Moreover, decompression performance is really critical for our end-to-end performance, as our SQL virtual machine can process more than 4GB/s/core of raw ion data, whereas zstd decompression typically runs at only about 1GB/s/core.

In order to provide some of the performance benefits of compressed columnar storage, we use a technique we call “compression tiling.” Each block of data (a group of rows) has its top-level fields hashed into one of sixteen buckets, and each of the sixteen buckets of data is compressed separately. We call this compression format zion, which is short for “zipped ion.”

Each zion “bucket” encodes both the field label (as an ion symbol) and the field value for each assigned field in each record. Prepended to the sixteen compressed buckets of data is a compressed “shape” bitstream that describes how to traverse the buckets to reconstruct the original ion records. At query execution time, we can elide reconstruction of all the fields of each record that are not semantically important for the query, which means that we can achieve up to a 16x reduction in the amount of time we spend decompressing data. Right now we use the zstd general-purpose compression algorithm for compressing all the buckets and the “shape” bitstream, but this encoding technique is agnostic to the compression algorithm(s) used, so we may mix-and-match compression algorithms in the future.

As an optimization, the hash function that we use for assigning top-level fields to buckets has an adjustable seed that is encoded as part of the bitstream. The encoder picks a seed that provides a the lowest-variance distribution of decompressed sizes for each of the buckets before actually splitting up the rows into buckets. As a consequence, we do not typically have issues with the buckets becoming seriously unbalanced.

An Example

Let’s consider how we would encode the following ion structure:

{ my_string: "hello", my_number: 3, my_bool: false }

If this record was the first record in a stream (i.e. if the symbol table was zero-initialized), then we’d probably have the ion symbols 0x0a, 0x0b, and 0x0c assigned to my_string, my_number, and my_bool, respectively, and for simplicity let’s assume those symbols end up being hashed to buckets 5, 3, and 1. Ordinarily, binary ion encoding of the structure above (in hex) would be db8a8568656c6c6f8b21038c10. Instead, we’d write 8a8568656c6c6f into bucket 5 (the ion field 0xa plus the ion string 'hello'), then 8b2103 into bucket 3 (the ion field 0xb plus the ion integer 3), and finally 8c10 into bucket 1 (the ion field 0xc plus the ion boolean false). In the shape bitstream, we’d write 033501. The first byte 0x03 indicates that we encoded a row with three fields. The next two bytes encode the buckets as individual nibbles, lsb-first. (We always round the encoded size of an individual “shape” into an even number of bytes, so a 3-structure field is encoded as two bytes rather than one-and-a-half bytes.)

{ m y _ d s b t r 8 i a n g 8 : 5 " 6 h 8 e l 6 l 5 o " 6 , c m 6 y c _ n 6 u f m b 8 e b r : 2 1 3 , 0 3 m y 8 _ c b o 1 o 0 l : f a l s e } [ [ [ [ [ [ [ s 0 1 2 3 4 5 h ] ] ] ] ] ] a p m m 8 e y y a m ] _ 8 _ y b b n 8 _ 0 { 8 o u 5 s 3 5 c o 2 m t , l 1 b 6 r 3 1 : e 8 i 5 3 0 0 r n , f 3 : 6 g 0 a 5 : 1 1 l 3 } s 6 " e c h e 6 l c l o 6 " f

We’d repeat the process above for every new row of ion data for the block, and then compress the shape bitstream and the buckets and concatenate them to form a complete compressed zion block. We also include the ion symbol table in the shape bitstream, since both the shape bitstream and the symbol table need to be decompressed in order to unpack even a single structure field.

A careful reader may have noticed that we’re increasing the amount of decompressed data that needs to be emitted in order to describe each record, since we have to encode not only the entirety of each record’s contents distributed across the buckets, but also the shape bitstream. Fortunately, this ends up being an improvement in practice: the shape bitstream is typically highly compressible because adjacent structures typically have identical or nearly-identical “shapes.” Moreover, the individual buckets tend to be more compressible than the raw ion encoding, because there tends to be less entropy (i.e. fewer discrete ion types, fewer field names) in an individual bucket than there would be in the regular ion stream. In practice, zion-compressed ion data with zstd-compressed buckets tends to be about 10% smaller than simply wrapping the ion data with zstd compression naïvely.

Decoding a zion block is just a matter of running all the encoding steps above in reverse. After decompressing the shape bitstream and symbol table, we can map any requested fields (e.g. my_string) to symbol IDs, and we can hash those to determine which bucket(s) need to be decompressed. Then, we iterate the shape bitstream one item at a time and emit an ion structure composed of the field/value pairs encoded in each of the bucket(s) that we decompressed, taking care to omit any fields that we aren’t interested in reconstructing. If we decompress all the buckets and copy out each field/value pair unconditionally, we end up with a result that is bit-identical to the original input ion data. If we only need to produce one field, then we only have to decompress one bucket, and consequently we do (approximately) 1/16th of the decompression work we would otherwise have to do if we just encoded the rows naïvely.

Our SQL engine is quite sensitive to the performance of the “reconstruction” process for ion data from zion blocks, as it is in the critical path from compressed data to ion row data that can be fed to our SQL virtual machine. The implementation of zion.Decoder.Decode uses one of a handful of assembly routines to reconstruct the ion data based on the number of fields that need to be reconstructed. We’ve managed to make this reassembly process quite fast (many GB/s/core) in practice.

Learn More

You can read the source code for the zion format on Github.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground, or you can sign up for Sneller Cloud, which gives you access to automatic ingest from your S3 buckets and hundreds of CPU cores on which to run your queries.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!