Installing "The Hard Way" - Local Sneller

The Hard Way

This series of “The Hard Way” isn’t the fastest way to install Sneller, but is aimed if you really want to know about how Sneller works. It is intended for developers and operators to learn the details of Sneller, starting small on just a local server or laptop and ending with a production-grade multi-node cluster spun up via Kubernetes running on top of object storage.

Introduction

It’s best to learn Sneller from the ground up. Sneller is a full-featured petabyte scale database service, but we tried to keep it as simple as possible. This series guides you through each aspect step-by-step.

IMPORTANT: Note that your computer should be able to use AVX-512, so it requires at least an Intel Skylake processor to run these queries. If you can’t run Sneller on your PC, then you should run these experiments on an AWS, Azure or GCP virtual machine (VM).

Introducing SDB

JSON data is everywhere nowadays and data-lakes are growing day by day. JSON is a great fit for a lot of data, because it supports nested data and is schemaless. Typical SQL databases struggle with JSON and a lot of proprietary extensions have been created to deal with this. Sneller is based on PartiQL that adds standardized extensions to SQL to deal with nested data natively.

Although JSON is great, the text-based format isn’t ideal for fast processing. That’s why Sneller requires JSON to be ingested before it can be queried. We use a custom format that uses binary ION and uses compression to save space.

Sneller ships with a tool called sdb (short for Sneller DataBase) that can be installed using:

go install github.com/SnellerInc/sneller/cmd/sdb@latest

This tool can be used to convert JSON to Sneller’s on-disk format and run queries.

Static data

Let’s download some test data, ingest it, and run a query:

wget https://data.gharchive.org/2015-01-01-15.json.gz
sdb pack -o 2015-01-01-15.zion 2015-01-01-15.json.gz
sdb query -fmt json "SELECT COUNT(*) FROM read_file('2015-01-01-15.zion')"

Congratulations, you just ran your first Sneller query. This query only returns the number of rows in the dataset, so we could make it a little bit more interesting and determine the top 10 repositories in this dataset:

sdb query -fmt json "SELECT repo.name, COUNT(*) FROM read_file('2015-01-01-15.zion') GROUP BY repo.name ORDER BY COUNT(*) DESC LIMIT 10"

Sneller’s SQL allows to directly access the name field inside the repo object without any nasty syntax or hard-to-remember functions. It treats nested data as a first-class citizen.

Suppose you get some more data and you would like to query on that second file too. That can be done by converting the second file and querying both files at once:

wget https://data.gharchive.org/2015-01-01-16.json.gz
sdb pack -o 2015-01-01-16.zion 2015-01-01-16.json.gz
sdb query -fmt json "SELECT COUNT(*) FROM read_file('2015-01-01-15.zion') ++ read_file('2015-01-01-16.zion')"

Although this works, it doesn’t scale well to a lot of files. Directly querying the packed files can be useful for analyzing a static dataset, but it isn’t particular useful for a more dynamic environment.

Dynamic data

The previous example showed how to deal with a static dataset, but often datasets are dynamic and new data is arriving at regular intervals. For this purpose sdb has the sync command to automatically ingest data that has been added since the last sync.

The sync command expects to find a definition.json file (more details) that lists the files that should be ingested. We will create such a table definition and synchronize.

Note that Sneller expects all databases to be stored in the db folder that is relative to the root folder. The name of the database is tutorial and the table name is table. That’s why the definition.json should be in that folder.

mkdir -p db/tutorial/table
cat > db/tutorial/table/definition.json <<EOF
{
  "input": [
    { "pattern": "file://*.json.gz" }
  ]
}
EOF

We will now download two files from the Github archive and ingest them into this table:

wget "https://data.gharchive.org/2015-01-01-15.json.gz"
wget "https://data.gharchive.org/2015-01-01-16.json.gz"

Sneller maintains an index of all files that have been ingested. This index contains hashes of the ingested data to ensure integrity. This index file is protected using an index key, so we need to generate a 256-bit key and store it as a base-64 encoded string in the SNELLER_INDEX_KEY environment variable:

export SNELLER_INDEX_KEY=$(dd if=/dev/urandom bs=32 count=1 | base64)
echo "Using index-key: $SNELLER_INDEX_KEY"

The index-key is automatically picked up by sdb from this environment variable, so make sure you use the proper name. Write down the index-key, because if you loose it all data has to be ingested again. Note that you can unpack the individual packed files even without the index key. The key is only used for signing and doesn’t actually encrypt the data.1

sdb -root . sync tutorial table 

The data is ingested and you should see some additional files in the db/tutorial/table folder:

ls -l db/tutorial/table

It shows both an index file and a packed-XXXXXXXXXXXXXXXXXXXXXXXXXX.zion file. The index file is the actual directory of all ingested files and the resulting packed files. The packed-XXX.zion file holds the actual ingested data.

Now the data is ingested, the source files can be removed. If you need to get the JSON data back, you can simply run:

sdb unpack -fmt json db/tutorial/table/packed-*.zion

To actually determine if a file has been ingested, you can also ask sdb for the list of ingested files:

sdb -root . inputs tutorial table

Now the data has been ingested, it can be queried again:

sdb -root . query -fmt json "SELECT COUNT(*) FROM tutorial.table"

Download some more files from the GitHub archive and ingest again:

wget https://data.gharchive.org/2015-01-01-{17..22}.json.gz
sdb -root . sync tutorial table 
sdb -root . query -fmt json "SELECT COUNT(*) FROM tutorial.table"

You can either choose to trigger this synchronization when new data arrives or run it at regular intervals.

Next…

In this chapter you learned the fundamentals of Sneller, but it’s all running on a single machine and using local storage. This may be good enough for some simple scenarios, but it scales poorly and local disk storage may not be the most robust option. We’ll deal with that in part 2 of this series.


  1. The index-key should be treated as a secret. In this local example we can’t do much to protect it, but in more advanced scenarios the key shouldn’t be exposed.

     ↩︎