Branchless Code With AVX-512

by Henk-Jan Lebbink | May 10, 2023

Branchless Code with AVX-512

Single-instruction multiple data (SIMD) programming involves using a single instruction to operate on wide registers that can simultaneously contain vectors of values. To avoid using conditional branches in code, we programmers can use branchless programming. Instead of relying on if/else statements or loops to implement conditional logic, branchless programming leverages arithmetic and logical operations to achieve the same result.

Predicated Instruction Execution

In AVX-512, predication is a key feature that enables branchless programming. By allowing instructions to be conditionally executed based on a mask register (K register), predication makes it possible to implement conditional logic without branching. This can lead to improved performance by reducing branch mispredictions and pipeline stalls.

The K register is a 64-bit register that contains 1 bit for each element in a vector. If the corresponding bit in the K register is set to 1, the element is considered “active” and the instruction is executed on that element. If the corresponding bit is set to 0, the element is considered “inactive” and the instruction is skipped for that element. Colloquially we would say that a lane is either dead or alive. Generally, the Sneller code uses 16 lanes, this allows the code to load upto 16 data values from different locations with a single instruction, and do all sort of computations with these 16 lanes. Some lanes can become inactive while computing and comparing stuff, and once in a while the code optionally compresses threads with many dead lanes and fills up available lane slots.

An example can help clarify things. We will use the assembler used by the Go programming language, which has an unconventional syntax derived from the Plan 9 operating system. For example, ZMM8 is referred to as Z8, and RSI is denoted by SI. In addition, the information flows from left to right, so the destination is generally the last register (similar to the at&t syntax).

16 Lanes

Below are some code snippets and an explanation of how they work. We will be creating code that converts lowercase ASCII values to uppercase, which is a function that we use frequently in our codebase. First, we load 16 values of 32 bits each into data register Z8. Each 32-bit value contains 4 ASCII bytes for all 16 lanes. Next, we check whether each of the 64 ASCII bytes is a lowercase character. If a byte is lowercase, we convert it to uppercase and store the result in Z26.

KMOVW         K1,  K3                   ; copy eligible lanes       ;K3=scratch; K1=lane_active
VPGATHERDD    (SI)(Z2*1),K3,  Z8        ; gather data               ;Z8=data; K3=scratch; SI=data_ptr; Z2=data_off

Assuming that Z2 contains the 16x 32-bit offsets of the data that we want to process, and SI is a 64-bit pointer to this data, we can use the VPGATHERDD instruction to load the content at address SI + the value in Z2 for each active lane in K3. The results of the operation are stored in Z8, and K3 is overwritten with error codes. To avoid losing the contents of K1, we made a copy of it with the KMOVW instruction. If all 16 lanes in K1 are active, we will have 64 freshly loaded ASCII values in Z8. However, if a lane is inactive, Z8 will contain stale content for that lane, but that should not matter for the following code.

64 Lanes

VPCMPB        $5,  Z16, Z8,  K3         ; K3 := (data>=char_a)      ;K3=tmp_mask; Z8=data; Z16=char_a
VPCMPB        $2,  Z17, Z8,  K3,  K3    ; K3 &= (data<=char_z)      ;K3=tmp_mask; Z8=data; Z17=char_z

Let’s assume that Z16 contains 64 ASCII ‘a’ values, and Z17 contains 64 ASCII ‘z’ values. The first VPCMPB instruction compares each byte of the data with the character ‘a’, and stores the results in the 64-bit mask register K3. Note that loading the data uses 16 lanes, while the byte comparison interprets the data as having 64 lanes. The second VPCMPB instruction computes the conjunction of the first comparison with whether the data is smaller or equal to character ‘z’. The resulting mask register K3 has a value of 1 for every lane with a lowercase ASCII character, and 0 for all other values.

Next, we want to set the 6th bit to 0 for all active lanes in K3. One way to achieve this would be to use a hypothetical VPANDB instruction, which would perform a predicated AND-operation with byte granularity. Specifically, every active lane would be updated with the result of AND-masking with 0b1101'1111.

VPANDB         Z15, Z8,  K3,  Z8         ; Z8 &= Z15                ;Z8=data; Z15=c_b11011111

512 Lanes

We will use a VPERNLOGD instruction instead of VPANDB since the latter doesn’t exist. To broadcast the 64 bits in K3 to the 64 bytes in Z26, we will use a VPMOVM2B instruction. This means that each set bit in K3 will become a 0xFF byte in the corresponding lane, and a cleared bit will become a 0x00 byte. The next step may seem complex, but it simply involves applying boolean logic.

VPMOVM2B      K3,  Z26                  ; mask with selected chars  ;Z26=scratch_Z26; K3=tmp_mask
VPTERNLOGD    $0b01001100,Z15, Z8,  Z26 ; dark art                  ;Z26=scratch_Z26; Z8=data; Z15=c_0b00100000

During code review, I am often asked how I arrived at these magic numbers 0b01001100. I use a truth table, which describes when to set a bit to 0. It’s worth noting that for bitwise operations, each bit of an 8-bit byte can be considered a lane by itself. In this particular case, we aim to solve the problem of converting ASCII characters to uppercase using 512x 1-bit lanes. The VPERNLOGD instruction takes Z26 as input, and replaces those values with the computed results. The third column Z26 is the input column, the right most column Z26 is the output column.

Z15 Z8 Z26 result: Z26
0 0 0 0
0 0 1 0
0 1 0 1
0 1 1 1
1 0 0 0
1 0 1 0
1 1 0 1
1 1 1 0 (change from lower to upper)

Consider the first row of the truth table. If the bit in Z15 is 0, then the corresponding lane does not correspond to the 6th bit in the byte. In this case, we want to leave the bit in data Z8 unchanged, which means that the output column Z26 will simply equal the bit from data Z8. The same is true for the 2nd, 3rd, and 4th rows of the truth table.

In the 5th and 6th rows, the bit in Z15 is set, which means we may want to clear the corresponding data bit. However, if the data bit is already cleared, the result will also be 0.

In the 7th row, we would normally clear the bit (since Z15 is 0) if the corresponding data bit is set (Z8 is 1). However, if the lane does not correspond to a lowercase ASCII character (since Z26 is 0), we will leave the data bit unchanged (output Z26 is 1).

Finally, in the last row, all three conditions hold: the lane corresponds to the 6th bit (Z15 is 1), the corresponding data bit is set (Z8 is 1), and the lane corresponds to a lowercase ASCII character (Z26 is 1). In this case, the result is a bit set to 0.

The output column 0b01001100 holds the 8-bit immediate value that is required for the specific binary function being implemented.


The six lines of assembly code shown above enable the loading of 64 ASCII values from 16 lanes and their subsequent conversion to upper case. Although further processing is needed to make use of this transformed data, such as performing a case-insensitive string comparison, it is noteworthy that this was achieved without the need for loops or conditional branching. However, it should be noted that the most time-consuming aspect of this process is typically getting the data into the 16 lanes, which in the Sneller code involves a VPGATHERDD operation with a latency of around 30 cycles. Once the data is properly arranged within the vector registers, predicated computation can quickly make up for the initial investment of gathering the data. We at Sneller achieve multi GB/s/core throughput with complex string functions such as regular expressions and Unicode aware case-insensitive searching.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground, or you can sign up for Sneller Cloud, which gives you access to automatic ingest from your S3 buckets and hundreds of CPU cores on which to run your queries.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!