64-bit Integer Division with AVX-512

by Petr Kobalicek | May 30, 2023

Introduction

Integer division is an arithmetic operation that is not provided natively by SIMD instruction set extensions. It is not present in any SIMD extension provided by X86 architecture, which would have made perfect sense if nobody needed to do any divisions. Luckily, when you do need integer divisions, the simplest approach is to convert integers to floating points, and perform a floating point division instead. However, this approach may lead to undesired rounding errors if the dividend cannot be converted to a floating point without losing precision.

In this article we provide a vectorized solution to successfully divide signed 64-bit integers by taking advantage of AVX-512, and we will pay close attention to minimizing rounding errors.

64-Bit Integer to Float Conversion

In general, integers that use up to 53 bits can be converted without rounding errors, but integers that are greater will always be rounded. To make it easier to understand, let’s create a table that contains a binary representation of floating point numbers starting at power of 52, and the next decimal value that can be encoded.

64-bit float [HEX] Decimal Value As Powers
0x4330000000000000 4503599627370496 2^52
0x4330000000000001 4503599627370497 2^52 + 1
0x4340000000000000 9007199254740992 2^53
0x4340000000000001 9007199254740994 2^53 + 2
0x4350000000000000 18014398509481984 2^54
0x4350000000000001 18014398509481988 2^54 + 4
0x4360000000000000 36028797018963968 2^55
0x4360000000000001 36028797018963976 2^55 + 8
0x4370000000000000 72057594037927936 2^56
0x4370000000000001 72057594037927952 2^56 + 16
0x4380000000000000 144115188075855872 2^57
0x4380000000000001 144115188075855904 2^57 + 32
0x4390000000000000 288230376151711744 2^58
0x4390000000000001 288230376151711808 2^58 + 64
0x43A0000000000000 576460752303423488 2^59
0x43A0000000000001 576460752303423616 2^59 + 128
0x43B0000000000000 1152921504606846976 2^60
0x43B0000000000001 1152921504606847232 2^60 + 256
0x43C0000000000000 2305843009213693952 2^61
0x43C0000000000001 2305843009213694464 2^61 + 512
0x43D0000000000000 4611686018427387904 2^62
0x43D0000000000001 4611686018427388928 2^62 + 1024
0x43E0000000000000 9223372036854775808 2^63
0x43E0000000000001 9223372036854777856 2^63 + 2048

Thus, the first row of the table contains decimal 2^52, which has float (IEEE754) representation 0x4330000000000000. When we look at the next decimal 2^52 + 1, it is represented by float 0x4330000000000001. Note that both decimals can be distinguished by a floating point value. This is not true for the decimal in the third row: Decimal 2^53 + 1, is either rounded to float representation 0x4340000000000000 (2^53) or 0x4340000000000001 (2^52 + 2). Two observations from the table. First, the smallest rounding error occurs when 2^53 + 1 is represented as a 64-bit floating point. That is, integer value 2^53 + 1 is not representable as 64-bit floating point and will have to be rounded. Second, the largest rounding error occurs when 2^63 + 2047 is represented as a 64-bit floating point.

Next we will discuss a correction strategy to reduce these rounding errors.

The Algorithm

The algorithm that we use in our vectorized query engine is based on taking advantage of floating point division, but performing two iterations to correct the result of the first iteration. However, instead of performing one division per iteration, we calculate a reciprocal of the divisor and use multiplication in both iterations. This could lead to an “off by 1” error, but offers better performance as floating point division is much slower than multiplication. Off by 1 error is then corrected at the end.

Before going over the algorithm, it’s important to state that this article focuses on dividing signed 64-bit integers, but it’s easy to change it to divide unsigned 64-bit integers as we only use absolute values in the algorithm and assign the sign afterwards. Internally, unsigned 64-bit integers are used.

Before going to AVX-512 assembly, let’s use a scalar C pseudo code:

int64_t divide(int64_t dividend, int64_t divisor) {
  // Calculate absolute values of inputs.
  uint64_t a = abs(dividend);
  uint64_t b = abs(divisor);

  // Convert to double precision and calculate a reciprocal.
  double a_f64 = round_down(double(a));
  double b_f64 = round_up(double(b));
  double b_rcp = round_down(1.0 / b_f64);

  // Compute the first iteration converted to integer (round down).
  uint64_t first = round_down(uint64_t(a_f64 * b_rcp));

  // Compute difference between 'a' and the first iteration.
  a = a - (first * b);
  a_f64 = round_down(double(a));

  // Compute the second iteration converted to integer (round down).
  uint64_t second = round_down(uint64_t(a_f64 * b_rcp));

  // Compute the resulting value after 2nd iteration,
  // which could be less by 1 than the actual result.
  uint64_t result = first + second;

  // Compute the unsigned division remainder, which is needed so we
  // can calculate the final value. Due to rounding, this remainder
  // could be greater than 'b' - if it is, our result would need to
  // be corrected.
  a = a - (second * b);

  // The final correction - could be rewritten as `result += (a >= b)`.
  if (a >= b)
    result++;

  // Make the result unsigned if required.
  if ((dividend ^ divisor) & (uint64_t(1) << 63))
    result = 0 - result;

  // We are done...
  return int64_t(result);
}

That’s a lot of code! In AVX-512 assembly it would be around 25 instructions to precisely divide eight signed 64-bit integers. If we only want to divide unsigned integers, it will be a bit shorter as we would not need to handle signs. Additionally, it would be possible to process more lanes (like 16, 24, or 32) at the cost of duplicating the code for each extra 8 lanes to process.

AVX-512 Implementation

The final and branchless AVX-512 implementation that processes eight signed 64-bit integers in parallel could look like this:

// Implements a C function of the following prototype:
//
//   void div8xi64_avx512(int64_t* out, int64_t* dividend, int64_t* divisor)
//
// Using a standard AMD64 Calling Convention:
//
//   out      <- rdi
//   dividend <- rsi
//   divisor  <- rdx

div8xi64_avx512:
  // Calculate absolute values of inputs and prepare some constants.
  vbroadcastsd zmm3, [div8xi64_consts] // zmm3 <- constant f64(1.0)
  vmovdqu64 zmm0, [rsi]                // zmm0 <- dividend
  vmovdqu64 zmm1, [rdx]                // zmm1 <- divisor
  vpxorq zmm2, zmm0, zmm1              // zmm2 <- dividend ^ divisor
  vpabsq zmm4, zmm0                    // zmm4 <- abs(dividend)
  vpabsq zmm5, zmm1                    // zmm5 <- abs(divisor)
  vpxorq zmm6, zmm6, zmm6              // zmm6 <- constant i64(0)
  vpternlogq zmm7, zmm6, zmm6, 255     // zmm7 <- constant i64(-1) (all bites set to one)
  vpmovq2m k2, zmm2                    // k2 <- mask of lanes that must be negative

  vcvtuqq2pd zmm0, zmm4, {rd-sae}      // zmm0 <- double(dividend)
  vcvtuqq2pd zmm1, zmm5, {ru-sae}      // zmm1 <- double(divisor)
  vdivpd zmm1, zmm3, zmm1, {rd-sae}    // zmm1 <- reciprocal of divisor

  vmulpd zmm0, zmm0, zmm1, {rd-sae}    // zmm0 <- first (1st iteration result as double)
  vcvtpd2uqq zmm3, zmm0, {rd-sae}      // zmm3 <- first (1st iteration result as uint64_t)

  vpmullq zmm2, zmm3, zmm5             // zmm2 <- first * divisor
  vpsubq zmm4, zmm4, zmm2              // zmm4 <- dividend - first * divisor
  vcvtuqq2pd zmm0, zmm4, {rd-sae}      // zmm0 <- double(dividend - first * divisor)

  vmulpd zmm0, zmm0, zmm1, {rd-sae}    // zmm0 <- second (2nd iteration result as double)
  vcvtpd2uqq zmm2, zmm0, {rd-sae}      // zmm2 <- second (2nd iteration result as uint64_t)
  vpaddq zmm3, zmm3, zmm2              // zmm3 <- first + second

  vpmullq zmm2, zmm2, zmm5             // zmm2 <- second * divisor
  vpsubq zmm4, zmm4, zmm2              // zmm4 <- dividend - (first + second) * divisor

  vpcmpuq k1, zmm4, zmm5, 5            // k1 <- lanes that are off by 1 and needs correction
  vpsubq zmm3 {k1}, zmm3, zmm7         // zmm3 <- first + second + off_by_1_correction
  vpsubq zmm3 {k2}, zmm6, zmm3         // zmm3 <- flip signs of lanes that must be negative
  vmovdqu [rdi], zmm3                  // [rdi] <- store results to [rdi]

align 8
div8xi64_consts:
  dq 0x3FF0000000000000                // constant f64(1.0)

Of course this function could be rewritten to process more lanes at the cost of duplicating the code. At Sneller, we actually process 16 lanes in parallel to hide latencies of some instructions so our own code has double the size. However, for the purpose of a blog post we prefer readability instead.

Instructions having the highest latency in our case are vdivpd, which is used only once, and vpmullq, which has unfortunately latency greater than 10 clock cycles on Intel hardware, but on AMD Zen4 hardware it has the same latency as other multiplication instructions, so it’s not a bottleneck.

Performance Comparison

Although the performance was not the primary motivation behind this implementation as we really needed a vectorized division in our SQL engine, we have benchmarked the performance of our AVX-512 implementation and compared it to a simple scalar version implemented in C++. The scalar version just uses (a / b) expression and lets C++ compiler optimize the code - the assembly code generated uses X86 idiv instruction.

Signed 64-Bit Integer Division Performance

The chart above shows the time to divide 1 billion signed 64-bit integers in C/C++ compared to our AVX-512 implementation on a Zen4 machine. For the purpose of the benchmarking we have also added a 16 lane version to get an idea whether an interleaved code that divides 16 integers at once has any effect. The scalar C/C++ code requires 1275ms to divide 1 billion integers, whereas the AVX-512 implementation that divides eight integers in parallel requires only 370ms, which is a nice 3.5x speedup. The AVX-512 implementation that divides 16 integers in parallel requires 358ms, which is not a significant improvement over the baseline AVX-512 implementation.

Conclusion

Although none of AVX-512 instruction set extensions provide an instruction that would perform a division of 64-bit integers, we were able to implement it and our implementation is around 3.5 times faster than a scalar version that uses idiv instruction. Our implementation uses advanced AVX-512 features such as embedded rounding control and lane masking. In addition, it divides signed 64-bit integers, but since it uses unsigned integers internally it would be trivial to change the implementation to divide unsigned 64-bit integers by removing the code that handles signs. Additionally, it’s also trivial to change the code to perform a modulo instead, or to return both.

The code provided is mostly useful in cases in which the domain of input values is not known, because if the divisor is known, it would make sense to precalculate the reciprocal and to only use a single iteration if the divisor is equal to or greater than 2048 (in that case only the off by 1 correction would be necessary as the reciprocal is rounded down).

We would like to hear about possible improvements that can be applied to this code. We consider our implementation to be pretty straightforward and easy to reason about. For example, the rounding control makes sure that the 1st and 2nd iteration will always produce results that are lesser than the final result, so we only watch a single direction in the code that performs the final correction.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!