Sneller: Querying terabytes of JSON per second

by Frank Wessels | May 3, 2023

Querying terabytes of JSON per second

In this blog post we will be showing that Sneller is capable of querying terabytes of JSON per second on a medium-sized compute cluster.

By leveraging AVX-512 instructions, Sneller is able to process many records in parallel on a single core, giving it a massive performance advantage over competing solutions.

Sneller supports JSON data natively without the need to define any schemas, and it supports enormous datasets spanning from terabytes (TBs) to petabytes (PBs) by leveraging object storage like S3.

GitHub archive dataset

For testing we will be using the GitHub archive dataset. For the year 2021 it contains over 1 billion records and our test dataset includes the full payload field for each event. Note that the structure of the payload field is different for each event type and fairly complex containing 100+ fields.

The dataset is stored on S3 in compressed form and measures about 390GB in size, split over several hundreds of objects.

As you can see in the list of queries below, Sneller makes it intuitive to work with JSON and supports “Path Expressions” natively for dereferencing. For example, payload.comment.body references the body field inside the comment field inside the payload field (for “IssueCommentEvent” events).

But before we get started, let’s query the total size and number of records in the test dataset:

curl -i -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode 'query=SELECT COUNT(*) FROM gha'
HTTP/2 200 
date: Fri, 05 May 2023 16:50:04 GMT
...
x-sneller-max-scanned-bytes: 3193248415744
...
{ "count": 1020957952 }

We can see that the total size as reported by x-sneller-max-scanned-bytes is 3,193,248,415,744 bytes or 2.9 TiB. We can trivially compute the average size per record by dividing it by 1,020,957,952 records, which gives us 3127 bytes per record.

Sneller makes it easy to do free “dry runs” to get insights into the behavior of your queries.

Queries

The following benchmarks are performed using Sneller’s public playground and it is scaled to run on 640 vCPUs (so 320 physical cores with 2 threads/core). Note that these are non-trivial queries that scan through the full dataset to give a balanced overview of the performance.

Let’s show the first query “by hand” via curl and measure the execution time through time (note: just copy-paste the command to your terminal to test this yourself):

time curl -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT type, COUNT(*) FROM gha GROUP BY type ORDER BY COUNT(*) DESC'
{"count": 501471103, "type": "PushEvent"}
...
{"count": 2770538, "type": "GollumEvent"}
real    0m1.498s

So it is easy to see that this query took 1.498 seconds. See the table below of the results of some other queries:

Query Took (sec) TiB/sec GiB/sec/vCPU
Overview of all event types 1 1.498 2.0 3.1
Find Linus Torvalds’ repos 2 2.419 1.2 1.9
Date histogram of events 3 1.914 1.6 2.4

 
As you can see, the query performance is consistently well over 1 TiB/sec.

The last column of the table shows the amount of data that was scanned per second per core, and this is about 2 GiB/sec/vCPU or more.

Affordable performance

Sneller Cloud is priced at $50/PB scanned, which means each of the queries above would cost just $0.14.

This compares very favorably to alternative solutions such as BigQuery and Athena that are priced at $5 per terabyte and do not offer the same level of performance.

Partitioned queries

Here are the measurements of some more advanced queries. These focus on a specific type of event and also touch the (large) payload field for event-specific additional information.

Query Took (sec) Scanned (TiB)
Search issues using regexp 4 1.781 0.33
PRs taking over 180 days 5 4.594 1.45

 
Sneller still maintains an “interactive” level of performance on this large dataset due to Sneller’s AVX-512 assembly acceleration.

Try for yourself

You can repeat all these experiments (or tweak the queries, ie. look for your own repos during 2021…) by either:

  • copy-paste the curl commands into your terminal, or
  • copy-paste the queries below into Sneller’s online playground to try it out in your browser (yes, even on your mobile phone!)

More background on Sneller

Sneller is a high-performance SQL engine built to analyze petabyte-scale unstructured logs and other event data. Here are a couple major differentiators between Sneller and other SQL solutions:

Explore further

There are three easy ways to get started with Sneller:

or checkout our documentation.

Appendix: queries


  1. SELECT type, COUNT(*) FROM gha GROUP BY type ORDER BY COUNT(*) DESC ↩︎

  2. SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE 'torvalds/%' ↩︎

  3. SELECT dayOfWeek, type, COUNT(*) FROM gha GROUP BY EXTRACT(DOW FROM created_at) AS dayOfWeek, type ORDER BY dayOfWeek, COUNT(*) DESC ↩︎

  4. SELECT payload.comment.body FROM gha WHERE type = 'IssueCommentEvent' AND payload.comment.body ~ 'Sherlock [A-Z]\\w+' ↩︎

  5. SELECT COUNT(*) FROM gha WHERE type = 'PullRequestEvent' AND DATE_DIFF(DAY, payload.pull_request.created_at, created_at) >= 180 ↩︎

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!