Table definition

Introduction

This section describes how to create a definition.json file to define a database table and the schema of the JSON object inside of the definition file. The definition.json file should be placed in an AWS S3 bucket with an object key conforming to the following pattern, where <db-name> is the name of the database and <table-name> is the name of the table.

db/<db-name>/<table-name>/definition.json

Contents and schema

The contents of the definition.json file should be a single JSON object with the following schema.

  • inputs (list, required) is the list of inputs that define the location and format of files that are used to build the table. Each item in the list is an object with the following schema:
    • pattern (string, required) is a URI glob pattern that specifies which files are fed into the table. Example: s3://bucket/dir/*.json, s3://bucket/{tenant}/{region}/*.json.gz
    • format (string, optional) is the format of the files in pattern. If this is blank or omitted, then the format will be inferred from the extension. Examples: json.zst, json.gz, json, csv, tsv
    • hints (list or object, optional) are used to provide hints about the input data to the parser. These hints may be used to perform type-based coercion of certain fields in the input data, and may additionally eliminate some of the data as it is parsed. Hints data is format-specific. See the relevant section for details.
  • partitions (list, optional) specifies synthetic fields that are generated from components of the input URI and used to partition table data. The partition definitions refer to named components of the input pattern; for example, tenant and region in s3://bucket/{tenant}/{region}/*.json.gz. Each item in this list is an object that has the following schema:
    • field (string, required) is the name of the partition field. If this field conflicts with a field in the input data, the partition field will override it.
    • type (string, optional) is the type of the partition field. Possible values for this field are string, int, date, or timestamp. If this field is blank or omitted, this defaults to string.
    • value (string, optional) is a template string that is used to produce the value for the partition field. The template may reference parts of the input URI specified in the input pattern by including a substring like $name or ${name}. A literal $ can be inserted into the output by using $$ in the template. If this field is blank or omitted, the field name is used to determine the input URI part that will be used to determine the value. Examples: ${tenant}_${region}, $yyyy-$mm-$dd
  • retention_policy (object, optional) is the expiration policy for table data. Data older than the expiration window will be periodically purged from the backing store during table updates. This object has the following schema:
    • field (string, required) is the path expression for the field used to determine the age of a record for the purpose of the data retention policy. Currently only timestamp fields are supported.
    • valid_for (string, required) is the validity window relative to the current time. This is a string with a format like <n>y<n>m<n>d where <n> is a number and any component can be omitted. Examples: 6m, 1000d, 1y6m15d
  • beta_features (list, optional) is a list of feature flags that can be used to turn on features for beta testing. Feature flags may be removed as features are rolled in general production. Unknown features are silently ignored.
  • skip_backfill (boolean, optional) if set to true, will cause this table to skip scanning the source bucket(s) for matching objects when the first objects are inserted into the table.

Here is a minimal example of a definition.json file used to create a table from a set of JSON input files in an AWS S3 bucket called example-bucket sharing a common prefix logs/.

{
  "input": [
    {
      "pattern": "s3://example-bucket/logs/*.json",
      "format": "json"
    }
  ]
}

Here is a more complex example using logs produced by AWS VPC Flow Logs as input data. This example shows how the hints field can be used to configure the parser for comma-separated and tab-separated tabular data. It also defines two levels of partitioning (by region and date) using components of the input object path, and defines a retention policy of 1 year and 6 months based on the end field in the input data.

{
  "input": [
    {
      "pattern": "s3://logs-bucket/vpcflowlogs/AWSLogs/*/vpcflowlogs/{region}/{yyyy}/{mm}/{dd}/*.log.gz",
      "format": "csv.gz",
      "hints": {
        "skipRecords": 1,
        "separator": " ",
        "missingValues": [ "-" ],
        "fields": [
          { "name": "version", "type": "int" },
          { "name": "account_id", "type": "string" },
          { "name": "interface_id", "type": "string" },
          { "name": "srcaddr", "type": "string" },
          { "name": "dstaddr", "type": "string" },
          { "name": "srcport", "type": "int" },
          { "name": "dstport", "type": "int" },
          { "name": "protocol", "type": "int" },
          { "name": "packets", "type": "int" },
          { "name": "bytes", "type": "int" },
          { "name": "start", "type": "datetime", "format": "unix_seconds" },
          { "name": "end", "type": "datetime", "format": "unix_seconds" },
          { "name": "action", "type": "string" },
          { "name": "log_status", "type": "string" }
        ]
      }
    }
  ],
  "partitions": [
    { "field": "region" },
    { "field": "date", "value": "$yyyy/$mm/$dd" }
  ],
  "retention_policy": {
    "field": "end",
    "valid_for": "1y6m"
  }
}

Format-specific hints

This section provides examples of the hints field that can be provided for different input file types.

Hints for the JSON parser

The following is an example of the value the hints field can take when used to provide type information for json type inputs.

[
  { "path": "path.to.field.a", "hints": "string" },
  { "path": "path.to.field.b", "hints": ["string", "bool"] }
]

The precedence of overlapping rules is determined by the order in which the rules are written. The ?/[?] wildcard can be used to match all keys of the current level. The */[*] wildcard can be used to match all keys of the current level and all following levels. The * wildcard must be the last segment in the path.

The hints field must either be a string or a list of strings. The following values are supported in the hints field.

  • default causes the type of the emitted value to be determined by the JSON type of the corresponding field. This is the default behavior if no type hint is included.
  • string indicates that the property is allowed to be a string value.
  • number indicates that the property is allowed to be either a floating point number or integer value.
  • int indicates that the property is allowed to be an integer value.
  • bool indicates that the property is allowed to be a boolean value.
  • datetime indicates that the property is allowed to be an RFC 3339 timestamp string with an optional nanosecond component. The value will be emitted as a timestamp truncated to microsecond precision.
  • unix_seconds indicates that the property is allowed to be an integer value which is interpreted as the number of seconds since the Unix epoch. The resulting field will be emitted as a timestamp.
  • unix_milli_seconds is as above but for milliseconds.
  • unix_micro_seconds is as above but for microseconds.
  • unix_nano_seconds is as above but for nanoseconds. The emitted value will be truncated to microsecond precision.
  • ignore causes the property to be ignored during parsing.
  • no_index is used to prevent timestamp properties from being automatically included in the index. This does not have any effect on non-timestamp fields.

Hints for the CSV and TSV parsers

The hints object provided to the CSV and TSV parsers adheres to the following schema.

  • skip_records (number, optional) allows skipping the first N records, which can be useful when input files contain headers.
  • separator (string, optional) allows specifying a custom separator. This only applies to CSV input files. If this field is set, it must be a string containing a single character.
  • missing_values (list, optional) is a list of strings which represent missing values. Entries in fields may override this on a per-field basis.
  • fields (list, required) specifies the hint for each field. Each item in the list is an object following this schema:
    • name (string, required) is the name of the field.
    • type (string, optional) is the type of the field. This may be one of the following values:
      • string (default if blank or omitted)
      • number
      • int
      • bool
      • datetime
      • ignore
    • default (string, optional) is the default value if the column is an empty string.
    • format (string, optional) is the timestamp format if the type field is set to datetime. This may be one of the following values:
      • datetime (default)
      • unix_seconds
      • unix_milli_seconds
      • unix_micro_seconds
      • unix_nano_seconds (the emitted field value will be truncated to microsecond precision)
    • allow_empty (boolean, optional) is used to allow empty strings to be ingested. This is only valid for use with the string type. If missing or false, then a field with an empty value won’t be written for the record.
    • no_index if true specifies that the field shouldn’t be indexed. This is only valid for the datetime type.
    • true_values (list, optional) is a list of strings that represent a true boolean value. This is only valid for the bool type.
    • false_values (list, optional) is a list of strings that represent a false boolean value. This is only valid for the bool type.
    • missing_values (list, optional) is list of strings that represent a missing value for this particular column. This will cause any field matching this value to be omitted from the resulting object. If this is provided, it will override missing_values in the top-level hints object.