Building a SQL VM in AVX-512 Assembly

by Phil Hofer | March 22, 2023

One of Sneller’s novel features is a bytecode-based virtual machine written almost entirely in AVX-512 assembly. While Sneller is far from the first project to incorporate SIMD acceleration into a query engine, our interpreter is unusual in that it is implemented entirely in assembly. Additionally, our interpreter operates on flexibly-typed rows rather than on strictly-typed columnar data.

Why?

One of the UX goals we have for Sneller is to provide consistent, predictable performance across a wide range of possible queries. Since Sneller Cloud’s pricing is based on the number of bytes scanned and not the total number of CPU cycles consumed by the query, we’d like the ratio between the number of CPU cycles consumed and the number of bytes scanned to remain roughly constant. For example, if a user adds a regular expression search to a WHERE clause in a query, we’d like that to consume only marginally more CPU time. Sneller’s uniformly fast performance across a large range of queries and use-cases also means our users don’t have to spend nearly as much time optimizing queries.

Additionally, a fast interpreter makes it possible to scale queries down to a reasonable level of resource consumption when the size of the input data is not terribly large. If your interpreter is slow, your query engine will end up getting out-performed by simple shell scripts until your data becomes “big” enough to warrant a distributed query engine. (See “Command Line Tools can be 235x Faster than your Hadoop Cluster” for a motivating example). Our design makes the best possible use of the available hardware resources, regardless of whether that is just a single CPU core or several hundred CPU cores spread over a half dozen machines.

Top-Down View

Queries pass through several different intermediate representations before they reach a representation that is actually executable by our bytecode VM. Broadly speaking, there are two sub-problems involved in executing a distributed SQL query:

  1. Computing the best arrangement of “physical operators” (filtering, extended projection, aggregation, etc.) given the input query, the topology of the available compute resources, and all available sparse indexing metadata
  2. Executing each “physical operator” as efficiently as possible

The interpreter only concerns itself with the latter responsibility; our query planner is responsible for the former. By the time queries reach our interpreter, they have been decomposed into “physical operators” that need to evaluate one or more SQL expressions and pass the results to a subsequent operator.

For example, a query like

SELECT SUM(y)
FROM my_table
WHERE x < 3

would evaluate x < 3 for each input row and discard any rows where that expression doesn’t evaluate to TRUE. Then it would accumulate a running total of the sum of y. (For simplicity, we’re ignoring any optimizations that the query planner might be able to provide based on the sparse indexing metadata that is available for the table.)

Textually, we could describe the pipeline of query operations as

ITERATE my_table -> FILTER (x < 3) -> AGGREGATE SUM(y) -> output

Internally, our query engine uses the Amazon Ion binary format. Each row of data is an ion structure composed of zero or more fields which themselves may be structures or lists (much like JSON records). Evaluating x < 3 means locating the x field in each structure, unboxing it as a number, and then comparing it against the constant 3. (We don’t typically know in advance that x will be a number, or even that x will be present at all, but we’ll see that the interpreter deals with data polymorphism just fine.)

Expression AST

The WHERE x < 3 clause of a SQL query will eventually end up in the vm.Filter physical operator. Here’s what go doc has to say about creating a vm.Filter:

package vm // import "."

func NewFilter(e expr.Node, rest QuerySink) (*Filter, error)
    NewFilter constructs a Filter from a boolean expression. The returned Filter
    will write rows for which e evaluates to TRUE to rest.

The expr.Node type is the interface satisfied by any AST expression. In our x < 3 example, the concrete value passed as e would be:

&expr.Comparison{
   Op:    expr.Less,
   Left:  expr.Ident("x"),
   Right: expr.Integer(3),
}

SSA IR

The first stage of expression compilation is to convert the input AST into a Single Static Assignment-based intermediate representation that we can more easily optimize. Our SSA instructions generally map 1-to-1 to bytecode VM instructions.

Our SSA representation doesn’t use basic blocks, as they would complicate the intermediate representation, and they would make it harder to convert the SSA IR into something that can be vectorized. Instead, we use “predicated execution”, which means that we pass a one-bit predicate to each operation in order to indicate whether or not its execution should be short-circuited into a no-op. As we will see later on, using predicated execution instead of ordinary control-flow makes it much easier to implement the final bytecode program using vector instructions.

Our x < 3 example above would be decomposed into SSA that looks something like the following:

b0.k0 = init                   # initial row base and lane mask
v1.k1 = findsym       b0 k0 $x # compute x pointer and mask
i2.k2 = cmpv.i64.imm  v1 k1 $3 # compare x and 3; produces -1, 0, or 1
   k3 = cmplt.i64@imm i2 k2 $0 # compare i2.k2 < 0
ret: k3

The SSA value init simply describes the initial state of the VM, with b0 representing the row “base pointer” and k0 representing the initial predicate value (i.e. TRUE). The init SSA operation doesn’t generate any code; it exists only so that other instructions can actually reference the “initial state” (registers, etc.) of the VM as it is entered.

The findsym operation takes a struct base (b0), a symbol (in this case x), and a predicate (k0) and produces a “fat pointer” for the associated struct field (v1) along with a new predicate (k1) indicating whether the field is actually present.

The cmpv.i64.imm operation performs a generic comparison of a boxed value against an immediate integer and returns an integer (-1, 0, or 1) that represents the result of comparing the boxed value against the immediate. Since we’re comparing against a number, we need to handle two cases:

  1. If the value v1 is an integer, compare it against 3.
  2. If the value v1 is a floating-point number, compare it against 3.0

Finally, the cmplt.i64@imm operation produces our return value by producing a mask of all the lanes for which the comparison result is defined (k2) and less than zero (i.e. ordered-less-than). The value k3 is our return value from this program; if k3 is set, then the expression x < 3 has evaluated to TRUE and the row can be passed to subsequent physical operators.

You will notice that we don’t evaluate x < 3 as TRUE when x is not present or not a number. If x is not present, then k1 will be unset and thus k2 and k3 (the return value) will be as well. Similarly, if x is a string, then k2 will be unset (since the comparison is undefined), and thus so will k3. (Since we are only interested in expressions that evaluate to TRUE in this example, we don’t have to concern ourselves with the details of whether or not the expression evaluated to MISSING or NULL or FALSE; there are other situations where we have to be a bit more careful about how to interpret predicate values.)

Bytecode

Once we have built the SSA representation of our AST expression, we have to convert it into a representation that is actually executable by our bytecode VM.

The bytecode VM just executes a linear sequence of instructions, so our first order of business is computing a valid execution ordering of the SSA instructions. A post-order traversal of the SSA program beginning at the return instruction will always produce a valid instruction ordering, and in our example there is only one possible post-order traversal of instructions, so the result is trivial.

Once we have determined the instruction ordering, we need to allocate locations on our virtual stack for each value. As bytecode instructions are emitted, we release regions of the virtual stack associated with each bytecode argument that has reached the end of its live range, and then we allocate (or re-use) space for the “return values” of the instruction.

Every bytecode is encoded as a function pointer address (8 bytes), followed by the arguments (or stack slot locations) that the instruction needs to parse. For arguments that occupy stack slots, the relative offset of the slot is encoded as a two-byte unsigned integer. For example, the findsym instruction expects two stack arguments (b0 and k0) plus a 4-byte immediate for a symbol, so in total it expects 8 (2 + 2 + 4) bytes of arguments.

It’s worth noting that our model of the virtual machine doesn’t know or care about vectors of rows; it only models the set of high-level operations that have to occur for each row, and it uses predicated execution instead of control flow in order to make it easier to vectorize the implementation. In principle it would be possible to port the VM to a different architecture by picking a reasonable vector width for the hardware and translating each of the VM opcode assembly routines one-by-one.

Brief Interlude: AVX-512 and Mask Registers

For us, the most important feature of AVX-512 as compared to AVX2 is the presence of “mask” (or “predicate”) registers. Most AVX-512 instructions accept a mask register (k0 through k7) operand that causes the lanes corresponding to the zero bits of the mask to remain untouched by the instruction in question.

For example, here’s a masked 32-bit vector addition (vpaddd):

vpaddd %zmm12,%zmm7,%zmm7{%k5}

The instruction above adds the sixteen 32-bit integers in zmm12 to the corresponding 32-bit integers in zmm7 and writes the results into zmm7, but only for the lanes where k5 is set. In other words, this instruction does not modify any lanes in zmm7 where k5 is unset.

Per-lane predication is important to us because it allows us to emulate diverging control between lanes without using branches. For example, imagine we need to implement the following in AVX-512 assembly:

if (x < 3) {
    x += 2;
} else {
    x -= 1;
}

Instead of using control flow, we can simply compute a mask for x < 3 and then use that mask and its complement for each of the two arms of the if/else statement:

;; assume we have broadcast
;;  1 into %zmm1,
;;  2 into %zmm2,
;;  3 into %zmm3
vpcmpltd %zmm3, %zmm0, %k1        ;; k1 = zmm0 < zmm3, per lane
knotw    %k1, %k2                 ;; k2 = ~k1
vpaddd   %zmm0, %zmm2, %zmm0{%k1} ;; zmm0 += 2 iff k1
vpsubd   %zmm1, %zmm0, %zmm0{%k2} ;; zmm0 -= 1 iff k2

It is technically possible to emulate predicated execution using AVX2 features, but it is considerably more cumbersome than in AVX-512.

Fortunately for us, ARM SVE/SVE2 and the RISC-V Vector Extension both implement predicated execution, so we are not necessarily stuck with only AVX-512 as an implementation target.

Assembly Routines

Many of the bytecode assembly routines consist of only a few machine instructions. For example, cmplt.i64@imm corresponds to the bccmplti64imm bytecode operation, which is defined in the source code with a few macros:

TEXT bccmplti64imm(SB), NOSPLIT|NOFRAME, $0
  BC_CMP_OP_I64_IMM($VPCMP_IMM_LT)
  NEXT_ADVANCE(BC_SLOT_SIZE*3 + 8)

If you disassemble this code from a compiled binary, you end up with the following instruction sequence:

00000000007827e0 <bccmplti64imm>: ; at entry: %rax = bytecode instruction sequence; %r12 = virtual stack
  7827e0:   0f b7 10               movzwl (%rax),%edx           ; load argument slot 0 (return value)
  7827e3:   0f b7 58 02            movzwl 0x2(%rax),%ebx        ; load argument slot 1 (integer argument value)
  7827e7:   62 f2 fd 48 59 a0 04   vpbroadcastq 0x4(%rax),%zmm4 ; broadcast immediate integer argument
  7827ee:   00 00 00
  7827f1:   44 0f b7 40 0c         movzwl 0xc(%rax),%r8d             ; load argument slot 2 (predicate argument)
  7827f6:   c4 81 78 90 0c 04      kmovw  (%r12,%r8,1),%k1           ; load predicate to k1
  7827fc:   c4 e3 f9 30 d1 08      kshiftrw $0x8,%k1,%k2             ; distribute upper lanes to k2
  782802:   62 d1 fe 48 6f 14 1c   vmovdqu64 (%r12,%rbx,1),%zmm2     ; load lower 8 input integers
  782809:   62 d1 fe 48 6f 5c 1c   vmovdqu64 0x40(%r12,%rbx,1),%zmm3 ; load upper 8 input integers
  782810:   01
  782811:   62 f3 ed 49 1f cc 01   vpcmpltq %zmm4,%zmm2,%k1{%k1} ; k1 = lower 8 lane comparison result
  782818:   62 f3 e5 4a 1f d4 01   vpcmpltq %zmm4,%zmm3,%k2{%k2} ; k2 = upper 8 lane comparison result
  78281f:   c5 ed 4b c9            kunpckbw %k1,%k2,%k1          ; k1 = concatenated bits for lower and upper lanes
  782823:   c4 c1 78 91 0c 14      kmovw  %k1,(%r12,%rdx,1)      ; store comp. result in return value stack slot
  782829:   48 83 c0 16            add    $0x16,%rax             ; advance bytecode instruction stream
  78282d:   ff 60 f8               jmp    *-0x8(%rax)            ; jump to next instruction

Folks who have worked on interpreters will recognize this as a classic “direct threaded” interpreter: the instruction stream includes the address of the next instruction to dispatch. In this case, we’re using the hardware register rax as the “program counter” that is adjusted as we execute.

The indirect branch at the end of each bytecode routine is much more predictable than you might imagine, as the number of branch targets per instruction for any particular sequence of bytecode operations tends to be close to 1. (If a bytecode program is ten virtual instructions, but the ten instructions are distinct from one another, then the indirect branch that occurs at the end of each virtual instruction is perfectly predictable; it always branches to the following instruction’s implementation!) In practice the “dispatch overhead” of our interpreter is so low that it is barely measurable; the fact that we are doing sixteen lanes worth of work in each bytecode instruction helps amortize the cost of dynamic dispatch over lots of “real” work.

Trampolines

Once we have an executable sequence of bytecode operations, we need a way of entering the VM from portable Go code. Each of the physical operators implements a “trampoline” function (written in assembly) that populates the right VM registers with the initial state for a group of up to sixteen rows, invokes the bytecode by jumping into the first virtual instruction, and then takes the return value from the VM and does something sensible with it. In the case of vm.Filter, the evalfilterbc trampoline uses the final %k1 value returned from the VM to compress away input row delimiters using vpcompressd, and then the remaining rows are passed to a subsequent physical operator.

Trampoline routines can typically accept an arbitrarily large number of input rows. (We usually aim to process several hundred rows of data per call.) Importantly, this means that we spend basically zero time in actual portable Go code once we have compiled the bytecode; the “inner loop” of the VM is implemented entirely in assembly. This is a critical piece of the design, because it means the Go language does not meaningfully constrain the performance of the VM as compared to “faster” alternatives like C/C++ or Rust.

Further Reading

Our WHERE x < 3 example covers a vanishingly small portion of the capabilities of our VM. As of this writing we have more than 250 bytecode operations, spanning everything from hash trie lookup to string matching to great-circle distance calculation. There is also a considerable amount of complexity lurking in our query planner in order to translate more advanced SQL queries into something that fits into our relatively simple VM model. We are planning on covering some of those topics in much more detail in future blog posts.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!