by Henk-Jan Lebbink | May 10, 2023
Branchless Code with AVX-512
Single-instruction multiple data (SIMD) programming involves using a single instruction to operate on wide registers that can simultaneously contain vectors of values. To avoid using conditional branches in code, we programmers can use branchless programming. Instead of relying on if/else statements or loops to implement conditional logic, branchless programming leverages arithmetic and logical operations to achieve the same result.
Predicated Instruction Execution
In AVX-512, predication is a key feature that enables branchless programming. By allowing instructions to be conditionally executed based on a mask register (K register), predication makes it possible to implement conditional logic without branching. This can lead to improved performance by reducing branch mispredictions and pipeline stalls.
The K register is a 64-bit register that contains 1 bit for each element in a vector. If the corresponding bit in the K register is set to 1, the element is considered “active” and the instruction is executed on that element. If the corresponding bit is set to 0, the element is considered “inactive” and the instruction is skipped for that element. Colloquially we would say that a lane is either dead or alive. Generally, the Sneller code uses 16 lanes, this allows the code to load upto 16 data values from different locations with a single instruction, and do all sort of computations with these 16 lanes. Some lanes can become inactive while computing and comparing stuff, and once in a while the code optionally compresses threads with many dead lanes and fills up available lane slots.
An example can help clarify things. We will use the assembler used by the Go programming language, which has an unconventional syntax derived from the Plan 9 operating system. For example,
ZMM8 is referred to as
RSI is denoted by
SI. In addition, the information flows from left to right, so the destination is generally the last register (similar to the at&t syntax).
Below are some code snippets and an explanation of how they work. We will be creating code that converts lowercase ASCII values to uppercase, which is a function that we use frequently in our codebase. First, we load 16 values of 32 bits each into data register
Z8. Each 32-bit value contains 4 ASCII bytes for all 16 lanes. Next, we check whether each of the 64 ASCII bytes is a lowercase character. If a byte is lowercase, we convert it to uppercase and store the result in
KMOVW K1, K3 ; copy eligible lanes ;K3=scratch; K1=lane_active VPGATHERDD (SI)(Z2*1),K3, Z8 ; gather data ;Z8=data; K3=scratch; SI=data_ptr; Z2=data_off
Z2 contains the 16x 32-bit offsets of the data that we want to process, and
SI is a 64-bit pointer to this data, we can use the
VPGATHERDD instruction to load the content at address
SI + the value in
Z2 for each active lane in
K3. The results of the operation are stored in
K3 is overwritten with error codes. To avoid losing the contents of
K1, we made a copy of it with the
KMOVW instruction. If all 16 lanes in
K1 are active, we will have 64 freshly loaded ASCII values in
Z8. However, if a lane is inactive,
Z8 will contain stale content for that lane, but that should not matter for the following code.
VPCMPB $5, Z16, Z8, K3 ; K3 := (data>=char_a) ;K3=tmp_mask; Z8=data; Z16=char_a VPCMPB $2, Z17, Z8, K3, K3 ; K3 &= (data<=char_z) ;K3=tmp_mask; Z8=data; Z17=char_z
Let’s assume that
Z16 contains 64 ASCII ‘a’ values, and
Z17 contains 64 ASCII ‘z’ values. The first
VPCMPB instruction compares each byte of the data with the character ‘a’, and stores the results in the 64-bit mask register
K3. Note that loading the data uses 16 lanes, while the byte comparison interprets the data as having 64 lanes. The second
VPCMPB instruction computes the conjunction of the first comparison with whether the data is smaller or equal to character ‘z’. The resulting mask register
K3 has a value of 1 for every lane with a lowercase ASCII character, and 0 for all other values.
Next, we want to set the 6th bit to 0 for all active lanes in
K3. One way to achieve this would be to use a hypothetical
VPANDB instruction, which would perform a predicated AND-operation with byte granularity. Specifically, every active lane would be updated with the result of AND-masking with
; HYPOTHETICAL VPANDB! VPANDB Z15, Z8, K3, Z8 ; Z8 &= Z15 ;Z8=data; Z15=c_b11011111
We will use a
VPERNLOGD instruction instead of
VPANDB since the latter doesn’t exist. To broadcast the 64 bits in
K3 to the 64 bytes in
Z26, we will use a
VPMOVM2B instruction. This means that each set bit in
K3 will become a
0xFF byte in the corresponding lane, and a cleared bit will become a
0x00 byte. The next step may seem complex, but it simply involves applying boolean logic.
VPMOVM2B K3, Z26 ; mask with selected chars ;Z26=scratch_Z26; K3=tmp_mask VPTERNLOGD $0b01001100,Z15, Z8, Z26 ; dark art ;Z26=scratch_Z26; Z8=data; Z15=c_0b00100000
During code review, I am often asked how I arrived at these magic numbers
0b01001100. I use a truth table, which describes when to set a bit to 0. It’s worth noting that for bitwise operations, each bit of an 8-bit byte can be considered a lane by itself. In this particular case, we aim to solve the problem of converting ASCII characters to uppercase using 512x 1-bit lanes. The
VPERNLOGD instruction takes
Z26 as input, and replaces those values with the computed results. The third column
Z26 is the input column, the right most column
Z26 is the output column.
|1||1||1||0||(change from lower to upper)|
Consider the first row of the truth table. If the bit in
Z15 is 0, then the corresponding lane does not correspond to the 6th bit in the byte. In this case, we want to leave the bit in data
Z8 unchanged, which means that the output column
Z26 will simply equal the bit from data
Z8. The same is true for the 2nd, 3rd, and 4th rows of the truth table.
In the 5th and 6th rows, the bit in
Z15 is set, which means we may want to clear the corresponding data bit. However, if the data bit is already cleared, the result will also be 0.
In the 7th row, we would normally clear the bit (since
Z15 is 0) if the corresponding data bit is set (
Z8 is 1). However, if the lane does not correspond to a lowercase ASCII character (since
Z26 is 0), we will leave the data bit unchanged (output
Z26 is 1).
Finally, in the last row, all three conditions hold: the lane corresponds to the 6th bit (
Z15 is 1), the corresponding data bit is set (
Z8 is 1), and the lane corresponds to a lowercase ASCII character (
Z26 is 1). In this case, the result is a bit set to 0.
The output column
0b01001100 holds the 8-bit immediate value that is required for the specific binary function being implemented.
The six lines of assembly code shown above enable the loading of 64 ASCII values from 16 lanes and their subsequent conversion to upper case. Although further processing is needed to make use of this transformed data, such as performing a case-insensitive string comparison, it is noteworthy that this was achieved without the need for loops or conditional branching. However, it should be noted that the most time-consuming aspect of this process is typically getting the data into the 16 lanes, which in the Sneller code involves a
VPGATHERDD operation with a latency of around 30 cycles. Once the data is properly arranged within the vector registers, predicated computation can quickly make up for the initial investment of gathering the data. We at Sneller achieve multi GB/s/core throughput with complex string functions such as regular expressions and Unicode aware case-insensitive searching.
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground, or you can sign up for Sneller Cloud, which gives you access to automatic ingest from your S3 buckets and hundreds of CPU cores on which to run your queries.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!