Installing "The Hard Way" - Using Kubernetes

Introduction

Although it’s technically possible to run Sneller on your own VMs, there are some challenges that you need to solve:

  1. A load-balancer is required to distribute incoming requests over the distinct Sneller instances.
  2. The Sneller daemon needs to be able to connect to the workers, so it needs some kind of service discovery.
  3. When the load changes, you need to be able to scale the number of nodes dynamically.

Kubernetes

Kubernetes solves most of these challenges and is our preferred way of deploying Sneller on your own infrastructure. In this tutorial, we will assume you already have a Kubernetes cluster. You can deploy Kubernetes on your local computer (docs) or run a Kubernetes cluster in AWS, Google Cloud Platform or Microsoft Azure.

Make sure you have Docker Desktop installed and you can enable Kubernetes from the Docker Desktop settings. Then switch to your local Kubernetes instance and check if it is available:

kubectl config use-context docker-desktop
kubectl cluster-info

This tutorial will show the most basic way of deploying Sneller and doesn’t use namespaces, service accounts, TLS ingress, etc. It also uses Minio object storage in its most basic form and doesn’t provide high available redundant object storage.

Helm is used as the Kubernetes package manager, so make sure you have installed Helm on your local machine.

Install Minio in Kubernetes

DISCLAIMER: This example is not a guide on how to deploy Minio in Kubernetes. This uses just a basic example that is easy to use. Production-grade installations should follow the official installation instructions of Minio instead.

First we will add the Minio Helm repository and install the operator:

helm repo add minio https://operator.min.io/
helm install minio-operator minio/operator

Now the Minio operator has been installed, we can install the Sneller instance of Minio. We will use only a single server for this instance and use custom credentials instead of the default Minio credentials:

export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID=$(cat /dev/urandom | tr -dc '[:alpha:]' | fold -w ${1:-20} | head -n 1)
export AWS_SECRET_ACCESS_KEY=$(cat /dev/urandom | tr -dc '[:alpha:]' | fold -w ${1:-20} | head -n 1)
helm install \
  --set-string "tenant.name=sneller" \
  --set "tenant.pools[0].servers=1" \
  --set 'tenant.certificate.requestAutoCert=false' \
  --set-string "secrets.accessKey=$AWS_ACCESS_KEY_ID" \
  --set-string "secrets.secretKey=$AWS_SECRET_ACCESS_KEY" \
  minio-sneller minio/tenant
export S3_ENDPOINT=http://sneller-hl:9000
echo "Using AWS Access key ID: $AWS_ACCESS_KEY_ID"
echo "Using AWS Secret access key: $AWS_SECRET_ACCESS_KEY   (keep this private)"

When the installation has succeeded, two new services will be available in your cluster:

  • sneller-console (port 9090) that provides access to the web-console of your Minio installation. Run kubectl port-forward svc/sneller-console 9090 and connect your browser to http://localhost:9090 to access the console. You can enter the AWS access key ID and secret access key to gain access. Note that the console service is not required by Sneller itself, but it may be useful to look what’s going on in your object storage.
  • sneller-hl (port 9000) is the actual endpoint for the S3 API and will be used by Sneller to communicate with the Minio object storage.

This is all that is needed to install Minio in your cluster.

Use AWS S3 or Google Cloud Storage

If you prefer to use AWS S3 or GCS instead of Minio, then make sure you set and export the following environment variables:

  • AWS_REGION should point to the region where your bucket is located.
  • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY should hold the AWS access key ID and AWS secret access key.
  • S3_ENDPOINT should point to the S3 endpoint. Make sure you are using the endpoint of the region where your bucket lives, otherwise the bucket can’t be found.

Although most applications can work with AWS profiles too, the scripts in these examples assume that these variables are set explicitly and you may run into issues if you didn’t set them properly.

Create the Sneller bucket

Now we’ll run a pod that creates a bucket in our storage:

export SNELLER_BUCKET=s3://sneller-test-bucket
kubectl run aws -i --tty --image=amazon/aws-cli \
  --restart=Never --rm \
  --env="AWS_REGION=$AWS_REGION" \
  --env="AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" \
  --env="AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
  -- --endpoint="$S3_ENDPOINT" s3 mb $SNELLER_BUCKET

In this example this bucket will hold both the source data and the ingested data. In production scenarios it’s best to separate the source and ingestion buckets and apply different IAM policies to restrict acccess.

Install Sneller in Kubernetes

In part 2 we used a local installation of Sneller, but we will now use the Sneller daemon that distributes the query workload across all nodes in the cluster and provides a REST endpoint.

Installing Sneller requires two different kind of images. The first image holds sdb that is used to ingest data. The second image is the actual Sneller daemon that is responsible for executing the queries. Both images can be installed manually in the Kubernetes cluster, but we will use the Helm package manager instead.

The Helm script uses a reasonable set of default values that can be overridden. In this example we’ll stick to the defaults as much as possible. Check the Sneller Kubernetes reference for a detailed overview of all the settings.

Our Helm script needs to know the following information:

  • Endpoint of the object storage (Minio, AWS S3 or GCS).
  • AWS access key ID and secret key of the object storage (either Minio, AWS S3 of GCS).
  • Name of the bucket that will hold the ingested database tables.
  • Name of the database (used to automatically trigger sdb synchronization).

The Helm script will install 3 pods that run the Sneller daemon. It also registers a cronjob that runs sdb every minute to check whether new data has arrived. If so, then sdb will automatically ingest the data.

First we will install the repository for the Sneller Helm charts (--devel allows installing pre-releases):

helm repo add sneller https://charts.sneller.ai
helm install \
  --set-string "secrets.s3.values.awsAccessKeyId=$AWS_ACCESS_KEY_ID" \
  --set-string "secrets.s3.values.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY" \
  --set-string "configuration.values.s3EndPoint=$S3_ENDPOINT" \
  --set-string "configuration.values.s3Bucket=$SNELLER_BUCKET" \
  --set-string "sdb.database=tutorial" \
  sneller sneller/sneller --devel

The Helm script allows specifying a custom Sneller token and index-key, but it will generate random strings if you don’t specify one. These values can obtained like this1:

export SNELLER_TOKEN=`kubectl get secret sneller-token -o jsonpath="{.data.snellerToken}" | base64 -d`
export SNELLER_INDEX_KEY=`kubectl get secret sneller-index -o jsonpath="{.data.snellerIndexKey}" | base64 -d`

The Sneller daemon is not exposed outside the cluster. Typically an ingress resource will be used to expose the daemon outside the cluster and also add TLS on top of this. This will be explained in more detail when a complete production-grade cluster is installed using Terraform. Now run the following command to forward the Sneller daemon to your local machine:

kubectl port-forward service/sneller-snellerd 8000 > /dev/null &
SNELLERD_PID=$!

The Sneller daemon port-forwarding is running in the background and can be stopped again using kill $SNELLERD_PID when it’s not needed anymore. For now we’ll keep it running. Now the port-forwarding is active, we should be able to access the Sneller daemon:

curl http://localhost:8000

The default installation will use 3 nodes, but it can take a few seconds before all pods have been started correctly.

Ingesting data

The default Helm image installs a cronjob in the cluster that runs an sdb sync every minute.

MINIO users: Minio exposes its API on localhost (port 80) by default. The S3_ENDPOINT refers to the end-point as it’s available from within the cluster.

First, we need to create the definition.json file that is appropriate for this configuration:

cat > definition.json <<EOF
{
  "input": [
    { "pattern": "$SNELLER_BUCKET/source/*.json.gz" }
  ]
}
EOF
aws s3 --endpoint http://localhost cp definition.json $SNELLER_BUCKET/db/tutorial/table/

Now, we will download some sample data and upload it to our Sneller source bucket.

wget https://data.gharchive.org/2015-01-01-{15..16}.json.gz
aws s3 --endpoint http://localhost cp 2015-01-01-15.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint http://localhost cp 2015-01-01-16.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint http://localhost ls $SNELLER_BUCKET/source/

Because the sdb sync cronjob runs every minute, it may take up to a minute before the data is actually ingested. You can check if the cronjob has been run by checking the last scheduled time of the cronjob or the last running pods:

kubectl get cronjob sneller-sdb   # list the cronjob
kubectl get pods -l app=sdb       # list the past few pods that ran sdb

When the data has been ingested properly, then the db/tutorial/table should hold the data in addition to the table definition (definition.json):

aws s3 --endpoint http://localhost ls $SNELLER_BUCKET/db/tutorial/table/

Query the engine

Make sure the port forwarding to your Sneller daemon is still intact and then you can start running queries against the engine:

curl -H "Authorization: Bearer $SNELLER_TOKEN" \
     -H 'Accept: application/json' \
     'http://localhost:8000/query?database=tutorial' \
     --data-raw $'SELECT COUNT(*) FROM table'

Now download some more source files and upload them to the S3 bucket:

wget https://data.gharchive.org/2015-01-01-{17..18}.json.gz
aws s3 --endpoint http://localhost cp 2015-01-01-17.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint http://localhost cp 2015-01-01-18.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint http://localhost ls $SNELLER_BUCKET/source/

When the synchronization has ran, then the previous query should show some more records.


  1. Note that the index-key is typically not used for regular use. ↩︎