Partitions

Content on this page

Introduction

Partitioning can be configured for a table to improve data locality, provide data isolation, and reduce the number of bytes that need to be scanned to satisfy a query. To make use of this feature, input data must already be separated into multiple input files; all the rows in a particular input file are ingested into exactly one partition.

Partitioning works by capturing substrings of the input object path based on the configured input file pattern and inserting them as top-level fields in the emitted rows. The generated fields can be referenced in a query on those rows as if they were fields in the input data. A top-level field in the input data with the same name as the partition field will be overwritten by the inserted field. To avoid this, the field name chosen for the partition field should be unique.

Multiple partition fields can be configured for a table. Each distinct tuple of partition field values forms a different partition. Data in different partitions will always end up in separate output objects, thus providing data separation guarantees. Queries that include a condition that explicitly limits the partitions that need to be scanned will only scan the data associated with those partitions, reducing the number of bytes that need to be scanned.

Partitions are defined by setting the partitions field in the table definition. The partitions field is a list of JSON objects that each define one partition field. The following table definition defines a single partition field called region which references one segment of the input object path.

{
  "input": [
    {
      "pattern": "s3://example-bucket/logs/{region}/*.json.zst"
    }
  ],
  "partitions": [
    {
      "field": "region"
    }
  ]
}

If the contents of example-bucket were as follows:

logs/eu-west-1/access-log.json.zst
logs/eu-west-1/error-log.json.zst
logs/us-east-1/access-log.json.zst
logs/us-east-1/error-log.json.zst
logs/us-west-2/access-log.json.zst
logs/us-west-2/error-log.json.zst

…then once the logs table is fully ingested it will contain three partitions (eu-west-1, us-east-1, us-west-2) separated into distinct output files for each partition. A query specifying a partition by name, for example:

SELECT COUNT(*) FROM logs WHERE region = 'us-west-2'

…will only end up scanning output files that are part of the us-west-2 partition.

Partitioning on dates

Sneller automatically indexes timestamp fields found while ingesting input files, therefore partitioning input data on dates found in the input object path is not recommended when input data already contains a timestamp field.

In cases where input data does not contain a timestamp field and the date can only be determined from the input object path, a timestamp field can be derived from date components in an object path by defining a partition of type date in a table definition as in the following example.

{
  "input": [
    { "pattern": "s3://example-bucket/logs/{yyyy}/{mm}/{dd}/*.json.zst" }
  ],
  "partitions": [
    { "field": "date", "type": "date", "value": "$yyyy-$mm-$dd" }
  ]
}