by Phil Hofer | April 24, 2023
Many of Sneller’s target use-cases like telemetry dashboards, monitoring, and log analysis involve searching through data over specific periods of time. Sneller builds a lightweight index of timestamps that appear in your records as they are ingested in order to accelerate queries that only need to operate on a “window” of data within each table accessed by the query. Concretely, we often need to satisfy queries that look like this:
SELECT COUNT(*) FROM my_table WHERE timestamp > DATE_ADD(DAY, -30, UTCNOW()) -- everything in the last 30 days AND feature = 'my-feature-string'
Sneller can use the expression
timestamp > DATE_ADD(DAY, -30, UTCNOW()) to eliminate
many candidate rows from
my_table at query planning time, rather than having to evaluate
that expression for every row in the table.
What is a “Sparse” Index?
If you were going to store time-series data in a traditional SQL database, you would likely build an index on the timestamp field(s) in your schema in order to accelerate your queries. The database would build a data structure like a B-tree or an LSM tree in order to facilitate range queries on rows in your data along the timestamp axis, and every time you inserted a new record, this data structure would have to be updated. (If timestamps were part of your SQL primary key, or if you configured your index as a covering index, then your records would live directly within the B-tree or LSM tree. Otherwise, the index would maintain indirect references to the rows in the table.)
The advantages of a traditional row-level index are that it provides precise row-level range queries in sub-linear time and the ability to iterate the data along this axis (i.e. in timestamp order) without having to perform any additional sorting. However, an index like this also has some drawbacks:
- A complete index consumes space that grows linearly with the size of your table. If you have a multi-TB table, the space overhead of one or more complete indices can be prohibitive.
- Any index you maintain for your table needs to be maintained as new rows are inserted. In other words, every row insert involves a B-tree or LSM insert, and access to this data structure needs to be synchronized across threads or machines performing inserts. Typically tables and indices have to be split into shards in order to accommodate high insert throughput.
One alternative to building complete row-level hierarchical indices is to build something that is less precise (i.e. not “row-level”); these are typically referred to as “sparse” indices. Broadly speaking, a sparse index trades precision for lower space overhead and faster inserts. Instead of building an index on rows, we can build an index on “blocks” or some other feature of the on-disk layout used by the database engine. A common example of a sparse index is the metadata that is stored in Parquet files: typically Parquet metadata includes the minimum and maximum value for each column in the file. A query planner can use that information to quickly determine if any rows in the data would overlap with a time range specified in a query. If not, then the query planner won’t include that block of data in the query plan at all.
Sneller’s Monotonic Block Time Index
All of the rows in a Sneller table are stored in S3 objects, and those objects are composed
of “blocks” containing about 100MB worth of (decompressed) row data.
Sneller builds all of its sparse indices, including timestamp indices, at block-level granularity.
Metadata for recently-ingested blocks is stored in the
index file for the table, and older metadata
is stored in a hierarchical data structure ordered by the time at which the blocks were inserted into the table.
The query planner consults this metadata when determining which blocks in the table are relevant
to a particular query.
Sneller’s time index tries to exploit the fact that timestamped records inserted into tables tend to be “k-sorted”. In other words, the timestamp on records that were previously inserted tend to be older than the timestamps on records that you will be inserting in the future. Rather than explicitly sorting data as it arrives for insertion, Sneller’s ingestion logic tries to recognize the inherent “k-sorting” of records and emit metadata into the output data so that the query planner can exploit it when deciding which blocks of data need to be scanned. Importantly, we do not have to perform expensive per-row operations like B-tree or LSM-tree inserts as rows are inserted, and we also do not have to worry about synchronizing updates to this index across CPU cores or across multiple machines.
Instead of storing minimum and maximum timestamp values for each block of data within each S3 object, we store a monotonic sequence of timestamps and block intervals for minimum and maximum timestamps. For example, a file comprised of 10 blocks would store 1 to 10 “minimum timestamp” intervals and 1 to 10 “maximum timestamp” intervals. If the minimum and maximum timestamp in each block was greater than the previous block, respectively, we’d keep all 10 intervals. If the minimum or maximum value in each block is less than or equal to the previous block, we extend the previous block’s interval to cover the subsequent block(s). Storing the timestamp ranges this way has two benefits: first, it means that the size of the index shrinks as it becomes less precise; and second, the monotonic ordering of the minimums and maximums means we can use differential arithmetic coding to further reduce the encoded size of the index.
Here’s an example of the Sneller
sdb command-line tool automatically detecting
created_at timestamp field in Github Archive and building
a sparse index on that field:
$ # fetch some gharchive data: $ wget https://data.gharchive.org/2015-01-01-15.json.gz $ # see that created_at is an ascending timestamp field: $ gunzip -c 2015-01-01-15.json.gz | jq -r .created_at | head 2015-01-01T15:00:00Z 2015-01-01T15:00:01Z 2015-01-01T15:00:01Z 2015-01-01T15:00:03Z 2015-01-01T15:00:03Z 2015-01-01T15:00:03Z 2015-01-01T15:00:03Z 2015-01-01T15:00:04Z 2015-01-01T15:00:04Z 2015-01-01T15:00:05Z $ # pack up the data; this should build an inline index: $ sdb pack -o gha.zion 2015-01-01-15.json.gz $ sdb describe-file gha.zion file://gha.zion 3.180 MiB trailer: 1 blocks, 18874368 bytes decompressed (5.66x compression, zion+zstd) index created_at 1 left 1 right [2015-01-01T15:00:00Z to 2015-01-01T15:59:59Z] # <--- index was created! total blocks: 1 total compressed: 3.180 MiB total decompressed: 18.000 MiB (5.66x)
sdb pack invocation automatically detected that
created_at was an indexable timestamp field,
and it has minimum and maximum values of
Our playground has GitHub archive data from 2021 already loaded. Try running the following queries and see how the reported number of bytes scanned changes as the time range is expanded:
SELECT COUNT(*) FROM gha WHERE created_at > `2021-12-29T00:00:00Z` -- last two days AND type = 'PullRequestEvent'
SELECT COUNT(*) FROM gha WHERE created_at > `2021-12-25T00:00:00Z` -- last five days AND type = 'PullRequestEvent'
SELECT COUNT(*) FROM gha WHERE created_at > `2021-12-15T00:00:00Z` -- last fifteen days AND type = 'PullRequestEvent'
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground, or you can sign up for Sneller Cloud, which gives you access to automatic ingest from your S3 buckets and hundreds of CPU cores on which to run your queries.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!