Installing "The Hard Way" - Adding object storage

Introduction

In the previous part of the walkthrough we learned how to use Sneller on a single computer and using local storage. This walkthrough will show how to store the data in object storage (i.e. AWS S3) to ensure that data is stored reliably and is highly available.

IMPORTANT: If you follow the walkthrough, then please make sure you use the exact same names for the environment variables. Both the AWS CLI and sdb use these variables to provide proper defaults.

Prerequisites

In this example we will use Minio to mimic AWS S3, so it can still be run on a single instance. If you prefer to use AWS S3, then you can skip the installation of Minio and use your AWS credentials to access your S3 bucket. Google Cloud Storage is also supported in S3 interoperability mode.

Install AWS CLI

The AWS CLI is used to run the commands. Technically it’s possible to invoke these commands via Docker, but it’s much more convenient to have the AWS CLI available on your system. Refer to the AWS documentation on how to install the AWS CLI.

Install Minio

In this walkthrough we will deploy Minio using Docker, so make sure it is installed on your computer. Then run the following commands to generate the access-key and start Minio.

export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID=$(cat /dev/urandom | tr -dc '[:alpha:]' | fold -w ${1:-20} | head -n 1)
export AWS_SECRET_ACCESS_KEY=$(cat /dev/urandom | tr -dc '[:alpha:]' | fold -w ${1:-20} | head -n 1)
export S3_ENDPOINT=http://localhost:9000
docker pull quay.io/minio/minio   # pull latest version
docker run -d \
    -e MINIO_ACCESS_KEY=$AWS_ACCESS_KEY_ID \
    -e MINIO_SECRET_KEY=$AWS_SECRET_ACCESS_KEY \
    -e MINIO_REGION=$AWS_REGION \
    -p 9000:9000 -p 9001:9001 \
    quay.io/minio/minio server /data

Let’s go

First we will need to create a bucket that holds all the Sneller data. In this example we will store both the source and ingested data in the same bucket, but you are free to store source data in another bucket.

export SNELLER_BUCKET=s3://sneller-test
aws s3 --endpoint $S3_ENDPOINT mb $SNELLER_BUCKET

We’ll use the same data as in the first walkthrough, so first we will download two hours of the GitHub archive data and upload it to the S3 bucket in the source folder:

wget https://data.gharchive.org/2015-01-01-{15..16}.json.gz
aws s3 --endpoint $S3_ENDPOINT cp 2015-01-01-15.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint $S3_ENDPOINT cp 2015-01-01-16.json.gz $SNELLER_BUCKET/source/
aws s3 --endpoint $S3_ENDPOINT ls $SNELLER_BUCKET/source/

Now we need to create the definition.json file that is appropriate for this configuration:

cat > definition.json <<EOF
{
  "input": [
    { "pattern": "$SNELLER_BUCKET/source/*.json.gz" }
  ]
}
EOF
aws s3 --endpoint $S3_ENDPOINT cp definition.json $SNELLER_BUCKET/db/tutorial/table/

Sneller maintains an index of all files that have been ingested. This index contains hashes of the ingested data to ensure integrity. This index file is protected using an index key, so we need to generate a 256-bit key and store it as a base-64 encoded string in the SNELLER_INDEX_KEY environment variable:

export SNELLER_INDEX_KEY=$(dd if=/dev/urandom bs=32 count=1 | base64)
echo "Using index-key: $SNELLER_INDEX_KEY"

Now everything is set up to ingest the data. Note that sdb uses the environment variables as defaults, so based on the S3_ENDPOINT and SNELLER_BUCKET variables, it knows where to find the table definition and it uses the AWS_xxx variables to get access to object storage.

sdb sync tutorial table

You can check the ingested data by running the following command:

aws s3 --endpoint $S3_ENDPOINT ls $SNELLER_BUCKET/db/tutorial/table/

As you can see the index file has been created and a packed file that holds the ingested data.

All data has been ingested, so you can now start to run queries on the data in object storage:

sdb query -fmt json "SELECT COUNT(*) FROM tutorial.table"

When new data arrives, you can run sdb sync tutorial table again and ingest the new data.

Next…

In this walkthrough you learned how to store the data in S3 object storage instead of local storage. Although this increases the availability of your data, the engine itself is still limited to a single node. In the part 3 we will show how to run using the Sneller daemon and ensure that the query engine itself is also scalable and highly available.