Key features

Introduction

This section describes some key features that showcase how to operate Sneller more effeciently:

benefit from unlimited retention of data
performing free dry runs to understand and reduce the amount of data scanned
preventing (accidental) expensive queries from being run
querying “old” data as if it is new

Unlimited data retention

Sneller has been designed exclusively to run on top of object storage like S3, which means that the storage costs are very modest. It also means that it is feasible to keep data around for much longer than in other solutions that default to short retention periods.

With Sneller you can keep months or even years of historical data. The only costs associated with keeping data around are object storage costs.

You will only incur query costs when you actually run queries.

Doing dry runs

A key design criteria for Sneller Cloud was to be able to estimate the cost of a query up front. Cost estimates can be used to prevent expensive queries from being run, and it can also be used to help understand how the query planner is optimizing your query.

You can do a “dry run” for a query by sending it using an HTTP HEAD request as opposed to a GET. This will make sure that just the query planning is performed without actually executing the query. The HTTP response will contain a header called x-sneller-max-scanned-bytes that specifies the maximum of bytes that would be scanned in order to execute the query. (In some cases queries may scan less than x-sneller-max-scanned-bytes, like in the case of an explicit LIMIT clause in the SQL query.)

Let’s illustrate this with an example through curl and by using the --head flag to send a HEAD request. Since HTTP HEAD requests cannot contain a body (like when using a POST) we will be using URL query parameters instead.

To send the SQL statement, we are adding it as a url-encoded value called query via the --data-urlencode option. And we need to provide the name of the database that contains the table (database=demo).

Dry runs are free

Before showing the example, it is important to note that you are not charged for performing dry runs (since the query itself is not executed).

You can easily calculate the maximum price of a query by multiplying x-sneller-bytes-scanned by Sneller’s standard pricing of $50/PB.

Dry run example

Let’s start with examining this basic query as it would scan the full 2021 year of the GitHub archive data in the playground:

SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE 'torvalds/%'

curl --head -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE \'torvalds/%\''

date: Tue, 02 May 2023 20:55:11 GMT
...
x-sneller-max-scanned-bytes: 3193248415744
x-sneller-query-id: fea2c1ab-5952-4444-a969-0fd4feb5aef3

The returned x-sneller-max-scanned-bytes value is 3,193,248,415,744 bytes to be scanned (so that is almost 3TiB, again for the full year of 2021). The request returns almost instantly since it has to do very little work. (You can actually see for yourself, simply copy-paste the command into your terminal.)

By taking advantage of Sneller’s automatic time indexing, we can limit the amount of data that is scanned. For instance, we can look at just November and December 2021 by adding a time constraint like this created_at >= `2021-11-01T00:00:00.000Z` :

curl --head -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE \'torvalds/%\' AND created_at >= `2021-11-01T00:00:00.000Z`'

date: Tue, 02 May 2023 20:58:15 GMT
...
x-sneller-max-scanned-bytes: 540881715200
x-sneller-query-id: 52e2cca4-9d1a-4a1e-a935-e3ce06a48cd5

This reduces the reported x-sneller-max-scanned-bytes to 540,881,715,200 bytes, which amounts to an over 80% reduction in the total bytes to be scanned, or more than a 5x reduction in total cost.

Sometimes we can reduce the number of bytes scanned even further. Since the GitHub archive data is partitioned on the ’event’ type, we can limit the amount of data to be scanned further by tacking on an type = 'IssueCommentEvent' clause to the query:

curl --head -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE \'torvalds/%\' AND created_at >= `2021-11-01T00:00:00.000Z` AND type = \'IssueCommentEvent\''

date: Tue, 02 May 2023 21:03:19 GMT
...
x-sneller-max-scanned-bytes: 66790096896
x-sneller-query-id: 20ac51fe-fc8d-4fe4-acbe-10172cd97c78

Now we only have to scan 66,790,096,896, which is almost another order-of-magnitude reduction in the number of bytes to be scanned, and just a mere 2% of the full table size.

To summarize, here is the full overview:

| query              | max scanned bytes |
| ------------------ | ----------------- |
| initial            | 3,193,248,415,744 |
| ... with date      |   540,881,715,200 |
| ... and with type  |    66,790,096,896 |

Call to action: leverage dry runs

We encourage developers of taking advantage of these powerful capabilities because it is a win-win on both ends:

a reduction in costs
an increase in performance

And since Sneller is a truly serverless and multi-tenant solution, we have no “vested” interested in letting you run expensive queries that consume a lot of resources (and energy!).

A little bit of planning and careful consideration of features like partitioning can result in huge efficiency gains of 10x or more, without much effort.

Doing a dry run in code

Of course you can do a dry run programatically as well. Here is an example in Go:

// Try: go run dry-run.go
package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"net/url"
)

const (
	endpoint = "https://play.sneller.ai"
	database = "demo"
	table    = "gha"
	sql      = "SELECT DISTINCT repo.name FROM " + table + " WHERE repo.name LIKE 'torvalds/%'"
)

func main() {
	q := make(url.Values)
	q.Add("database", database)
	q.Add("query", sql)
	u, err := url.Parse(endpoint)
	if err != nil {
		log.Fatalln(err)
	}
	u.Path = "/query"
	u.RawQuery = q.Encode()

	// We'll be doing a HEAD with the SQL query passed in via the url
	req, err := http.NewRequest(http.MethodHead, u.String(), nil)
	if err != nil {
		log.Fatalln(err)
	}
	req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", "YOUR_TOKEN")) // not needed for playground

	client := &http.Client{} // create http client
	resp, err := client.Do(req)
	if err != nil {
		log.Fatalln(err)
	}
	defer resp.Body.Close()
	body, _ := io.ReadAll(resp.Body)

	if resp.StatusCode < 200 || resp.StatusCode >= 300 { // any errors?
		log.Fatal(fmt.Errorf("error (%s): %s", resp.Status, string(body)))
	}
	fmt.Println("For query        :", sql)
	fmt.Println("Max-Scanned-Bytes:", resp.Header["X-Sneller-Max-Scanned-Bytes"])
}

Here is the output that it will generate:

$ go run dry-run.go
For query        : SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE 'torvalds/%'
Max-Scanned-Bytes: [3193248415744]

Preventing expensive queries

To prevent (accidental) huge queries from being run, Sneller Cloud has a safety feature that will block, by default, queries scanning over 3 TB (which equates to about $0.17 at the standard pricing of $50/PB).

So in case, as part of the initial planning for a query, it is detected that it can scan more than the MaxBytesScanned threshold, Sneller will report back an error code and return early.

As you might imagine, it is possible to increase this parameter in Sneller Cloud to allow for larger queries, but it requires an explicit action. (Or you might even disable it altogether by setting it to 0, but this can potentially result in high costs for huge tables.)

Querying old data as if it’s new

Since Sneller caches the data that it loads from object storage into memory (DRAM), there is fundamentally no difference between data as it has been ingested over the last week compared to, say, data as it has been ingested months or even years ago.

What this means in practice is that, if you want to query a time span of eg. 7 days, the only parameter that determines the performance (and cost alike) is the maximum amount of bytes scanned for the particular time period.

Let’s illustrate with an example. Imagine we want to query two time periods that are 6 months part such as March 20-27 compared to September 20-27 for the year 2021 based on the GitHub archive dataset.

To start with, for the March 20-27 period, we are getting 159,910,985,728 max scanned bytes:

curl --head -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE \'torvalds/%\' AND created_at BETWEEN `2021-03-20T00:00:00.000Z` AND `2021-03-27T00:00:00.000Z`'

HTTP/2 200 
date: Fri, 05 May 2023 23:23:40 GMT
...
x-sneller-max-scanned-bytes: 159910985728

And for the September 20-27 period, we are getting 171,482,021,888 max scanned bytes:

curl --head -G 'https://play.sneller.ai/query?database=demo&json' \
--data-urlencode $'query=SELECT DISTINCT repo.name FROM gha WHERE repo.name LIKE \'torvalds/%\' AND created_at BETWEEN `2021-09-20T00:00:00.000Z` AND `2021-09-27T00:00:00.000Z`'

HTTP/2 200
date: Fri, 05 May 2023 23:24:05 GMT
...
x-sneller-max-scanned-bytes: 171482021888

So there is a slight difference (presumably due to GitHub’s natural growth), but in essence both queries will consume almost the same time and cost (with a slight uplift for the September period due to the increased size).

As you can see Sneller does not differentiate data based on its “age” and there is no notion of “cold” data for which the performance is much slower than recent “hot” data.

Intro

SQL

Onboarding

Cloud

The Hard Way