by Petr Kobalicek | March 31, 2023

# Introduction

This article explores the possibility of branchlessly converting multiple signed 64-bit integers to strings by taking advantage of AVX-512 extensions. Utilizing SIMD for integer to string conversion is not a new idea; however, most research and implementations focus on improving the performance of converting a single value instead of performing multiple conversions at once. At Sneller, we use AVX-512 to process 16 values in parallel, and thus we would like to describe how we have done it in our query engine. For simplicity, the code presented in this article only converts 8 integers at once.

# The Problem

We need to convert 16 signed 64-bit integers stored in two ZMM registers (8 integers in each) to their string representations. We have to handle all possible values, which means that a single integer would need at most 20 characters (19 characters for digits and one character for a sign in case the number is negative). The output from the conversion is a pair of ZMM registers. One stores lengths of strings, and the other stores the offsets of strings in a pre-allocated buffer. The procedure is free to write the strings anywhere within the buffer.

# Conversion Steps

- Sign Extraction - extracts signs from each input and make it unsigned; signs will be inserted later
- Character Conversion - splits input integers into three separate output registers each holding only a part of the input, and then converts these 3 registers to decimals
- Length Calculation & Sign Insertion - calculates the number of characters of each output string and inserts a sign at the beginning
- Store to Buffer - stores the content of output registers to an output buffer

# Sign Extraction

Before we perform any processing we have to make sure that our inputs are unsigned. So, we have to check which lane stores a negative number and make it absolute. Each lane that was negative must be remembered so we can insert minus signs at the end in `Length Calculation & Sign Insertion`

step.

In assembly it would look like this:

```
vpmovq2m k1, zmm0 ; zmm0 is the input, k1 is a mask of signs
vpabsq zmm0, zmm0 ; zmm0 becomes an absolute value (unsigned)
```

# Character Conversion

AVX-512 provides 512-bit vector registers, but most instructions work with 8-bit, 16-bit, 32-bit, and 64-bit elements. The output of the conversion requires more space than the input 64-bit integers, which means that it has to be split into 3 output registers representing the sequence of characters.

The layout of these 3 registers would be `[CCCC|BBBBBBBB|AAAAAAAA]`

where `CCCC`

represents 4 high characters, `BBBBBBBB`

represents 8 middle characters, and `AAAAAAAA`

represents 8 low characters. Note that the layout describes each lane of each register, so one register holds 8 times 8 lower characters, another 8 times 8 middle characters, etc…

The actual split of each input value can be described as the following equation, where `X`

is the original input value:

```
X == A + B * 100000000 + C * 10000000000000000
```

It’s not as scary as it looks. It has an elegant solution that can leverage FP64 division (which can be turned into multiplication); however, since our inputs are 64-bit, we have to adjust them first. The binary representation of `uint64(10000000000000000)`

is `[00000000|00100011|10000110|11110010|01101111|11000001|00000000|00000000]`

, which has 16 least significant bits as zeros. The actual division can be rewritten like this:

```
uint64_t div_10000000000000000(uint64_t x) {
(x / 10000000000000000) == ((x >> 16) / (10000000000000000 >> 16))
}
```

However, since we need to divide twice to get the ABC parts, we prefer:

```
uint64_t div_1000000000000(uint64_t x) {
(x / 1000000000000) == ((x >> 8) / (1000000000000 >> 8))
}
uint64_t div_10000000000000000(uint64_t x) {
(x / 10000000000000000) == ((x >> 8) / (10000000000000000 >> 8))
}
```

It’s okay that sometimes the input is not losslessly convertible to FP64 after we’ve shifted it by 8 bits, as the first division doesn’t need the extra bits anyway. Shifts have to be performed on integers, but the division can be performed as FP64 division (as there is no integer division in AVX-512). In fact, instead of using FP64 division we can just turn the division into multiplication without losing precision in our case. The full implementation of the split could look like this in C++:

```
void itoa_split_100000000(uint64_t x, uint64_t* aOut, uint64_t* bOut, uint64_t* cOut) {
constexpr uint64_t k1e8shr8 = uint64_t(100000000) >> 8;
constexpr uint64_t k1e16shr8 = uint64_t(10000000000000000) >> 8;
uint64_t x_shr8 = x >> 8;
// Calculate `x / 10000000000000000`
uint64_t c = uint64_t(double(x_shr8) * (1.0 / double(k1e16shr8)));
// Calculate `x % 10000000000000000`
uint64_t ab = x_shr8 - c * k1e16shr8;
// Calculate ab / 100000000
uint64_t b = uint64_t(double(ab) * (1.0 / double(k1e8shr8)));
// Calculate ab % 100000000
uint64_t a = ((ab - (b * k1e8shr8)) << 8) + (x & 0xFF);
*aOut = a;
*bOut = b;
*cOut = c;
}
```

Which can be rewritten to use AVX-512 intrinsics and to operate on 8 lanes:

```
void itoa_split_100000000(__m512i x, __m512i& a, __m512i& b, __m512i& c) {
constexpr uint64_t k1e8shr8 = uint64_t(100000000) >> 8;
constexpr uint64_t k1e16shr8 = uint64_t(10000000000000000) >> 8;
__m512i x_shr8 = _mm512_srli_epi64(x, 8);
c = _mm512_cvtpd_epu64(_mm512_mul_pd(_mm512_cvtepu64_pd(x_shr8), _mm512_set1_pd(1.0 / double(k1e16shr8))));
__m512i ab = _mm512_sub_epi64(x_shr8, _mm512_mullo_epi64(c, _mm512_set1_epi64(int64_t(k1e16shr8))));
b = _mm512_cvtpd_epu64(_mm512_mul_pd(_mm512_cvtepu64_pd(ab), _mm512_set1_pd(1.0 / double(k1e8shr8))));
a = _mm512_or_epi64(
_mm512_slli_epi64(_mm512_sub_epi64(ab, _mm512_mul_epu32(b, _mm512_set1_epi64(int64_t(k1e8shr8)))), 8),
_mm512_and_epi64(x, _mm512_set1_epi64(0xFF)));
}
```

Now each register contains a value in `[0, 99999999]`

range that can be further processed to get a decimal representation of it. See the following steps:

`[uint64_t(x)]`

->`[uint32_t(x % 10000), uint32_t(x / 10000)]`

`[uint32_t(x)]`

->`[uint16_t(x % 100), uint16_t(x / 100)]`

`[uint16_t(x)]`

->`[uint8_t(x % 10), uint8_t(x / 10)]`

We just take a single 64-bit value in each lane and produce two 32-bit values, then four 16-bit values, and finally eight 8-bit values each representing a single decimal digit in `[0, 9]`

range. See the following steps written in C++ and AVX-512:

```
__m512i itoa_split_10000(__m512i x) {
__m512i a = _mm512_srli_epi64(_mm512_mul_epu32(x, _mm512_set1_epi64(3518437209)), 45);
__m512i b = _mm512_sub_epi32(x, _mm512_mul_epu32(a, _mm512_set1_epi64(10000)));
return _mm512_or_epi64(a, _mm512_slli_epi64(b, 32));
}
__m512i itoa_split_100(__m512i x) {
__m512i a = _mm512_srli_epi16(_mm512_mulhi_epu16(x, _mm512_set1_epi32(5243)), 3);
__m512i b = _mm512_sub_epi16(x, _mm512_mullo_epi16(a, _mm512_set1_epi32(100)));
return _mm512_or_epi64(a, _mm512_slli_epi32(b, 16));
}
__m512i itoa_split_10(__m512i x) {
__m512i a = _mm512_mulhi_epu16(x, _mm512_set1_epi16(6554));
__m512i b = _mm512_sub_epi16(x, _mm512_mullo_epi16(a, _mm512_set1_epi16(10)));
return _mm512_or_epi64(a, _mm512_slli_epi16(b, 8));
}
```

# Length Calculation & Sign Insertion

The integers are already in a decimal form, but that’s not enough to present the outputs. Essentially, what we have now looks like this in `[CCCC|BBBBBBBB|AAAAAAAA]`

layout when stored in memory:

`int64(0123)`

->`[0000|00000000|00000123]`

`int64(0123456789)`

->`[0000|00000001|23456789]`

`int64(-123456789123456789)`

->`[0012|34567891|23456789]`

What is missing is a possible minus sign and the length of each string. In scalar code it would be possible to calculate the length with `TZCNT`

instruction, which calculates the number of trailing bits, but in AVX-512 there is only `VPLZCNT[D|Q]`

instruction to count leading bits. `VPLZCNTQ`

actually fits our needs as the content of `A`

, `B`

, and `C`

registers is stored in 64-bit lanes. Additionally, the content of `A`

, `B`

, and `C`

registers has to be byteswapped before using `VPLZCNTQ`

, because of the X86 byte-order - when the content of each register is written in memory, it would actually be swapped, so the lowest significant byte is stored first and becomes the most significant digit or a minus sign.

Finally, to calculate the length we will first set the length of each lane to its maximum (3 * 8 bytes represented in bits), and then reduce the length of each lane based on the result of `VPLZCNTQ`

. The reduction is conditional: if `C`

has non-zero digits, `B`

and `A`

would be masked out for the rest of the reduction. Additionally, signs are inserted during the length calculation, because it’s the best time to do so. To insert signs we just use the same output of `VPLZCNTQ`

as the length calculation and the same masking approach (if we have already inserted a sign in `C`

, we won’t do it in `B`

, etc…). This is actually a nice example of how AVX-512 masks can be used to implement code that would either require branches in a scalar version or explicit blending in a pre-AVX-512 implementation.

The C++ implementation of length calculation and sign insertion could be implemented like this:

```
__m512i itoa_char_count_and_insert_sign_generic(__m512i& a, __m512i& b, __m512i& c) {
__m512i i128_bswap64 = _mm512_setr_epi64(
0x0001020304050607, 0x08090A0B0C0D0E0F,
0x0001020304050607, 0x08090A0B0C0D0E0F,
0x0001020304050607, 0x08090A0B0C0D0E0F,
0x0001020304050607, 0x08090A0B0C0D0E0F);
__m512i i64_0x03 = _mm512_set1_epi64(0x03);
__m512i i64_0x07 = _mm512_set1_epi64(0x07);
__m512i i64_0x08 = _mm512_set1_epi64(0x08);
// Temporarily byteswap the numbers so we can use leading zero count.
__m512i a_alt = _mm512_shuffle_epi8(a, i128_bswap64);
__m512i b_alt = _mm512_shuffle_epi8(b, i128_bswap64);
__m512i c_alt = _mm512_shuffle_epi8(c, i128_bswap64);
// Stringified number must have at least 1 character, so make it nonzero so lzcnt can find it.
a_alt = _mm512_or_epi64(a_alt, i64_0x07);
// Count length in 8-bit quantities (still bits, but having 8-bit granularity).
__m512i a_len = _mm512_andnot_epi64(i64_0x07, _mm512_lzcnt_epi64(a_alt));
__m512i b_len = _mm512_andnot_epi64(i64_0x07, _mm512_lzcnt_epi64(b_alt));
__m512i c_len = _mm512_andnot_epi64(i64_0x07, _mm512_lzcnt_epi64(c_alt));
// Initial number of characters of each string that will be decreased to match the real length.
__m512i lengths = _mm512_set1_epi64(24 * 8);
__mmask8 len_msk;
lengths = _mm512_sub_epi64(lengths, c_len);
c = _mm512_sub_epi8(c, _mm512_sllv_epi64(i64_0x03, _mm512_sub_epi64(c_len, i64_0x08)));
len_msk = _mm512_cmpeq_epi64_mask(lengths, _mm512_set1_epi64(16 * 8));
lengths = _mm512_mask_sub_epi64(lengths, len_msk, lengths, b_len);
b = _mm512_sub_epi8(b, _mm512_maskz_sllv_epi64(len_msk, i64_0x03, _mm512_sub_epi64(b_len, i64_0x08)));
len_msk = _mm512_mask_cmpeq_epi64_mask(len_msk, lengths, _mm512_set1_epi64(8 * 8));
lengths = _mm512_mask_sub_epi64(lengths, len_msk, lengths, a_len);
a = _mm512_sub_epi8(a, _mm512_maskz_sllv_epi64(len_msk, i64_0x03, _mm512_sub_epi64(a_len, i64_0x08)));
return _mm512_srli_epi64(lengths, 3);
}
```

The function above calculates the output string length of each lane and inserts signs into `A`

, `B`

, and `C`

registers. Note that signs are always inserted in each lane; if the signs are actually presented depends on whether the input value was negative or not, which must be handled afterwards. For example the following code can be used to count signs on lanes in which the number was negative by using `signs`

mask:

```
lengths = _mm512_mask_add_epi64(lengths, signs, lengths, _mm512_set1_epi64(1));
```

The sign insertion uses a small trick that is worth describing. The ASCII representation of `'0'`

digit is 48, while ASCII representation of `'-'`

is 45, which is why we subtract 3 to make `'-'`

from `'0'`

.

# Store to Buffer

Storing the stringified integer into the output buffer is the final step in integer to string conversion. However, in our case we don’t have to store strings at exact positions in the output buffer, because in most cases it’s not yet known where the strings will be serialized at the moment of conversion. This actually simplifies the store step a bit as we always reserve a fixed number of bytes for each string and return back offsets where each string starts in the output buffer and its length.

For example in our case the store to buffer code looks like this:

```
__m512i lo_even = _mm512_unpacklo_epi64(b, a);
__m512i lo_odd = _mm512_unpackhi_epi64(b, a);
__m128i hi_10 = _mm512_extracti32x4_epi32(c, 0);
__m128i hi_32 = _mm512_extracti32x4_epi32(c, 1);
__m128i hi_54 = _mm512_extracti32x4_epi32(c, 2);
__m128i hi_76 = _mm512_extracti32x4_epi32(c, 3);
itoa_store_lo64(output_buffer + 0 * 24, hi_10);
itoa_store_u128(output_buffer + 0 * 24 + 8, _mm512_extracti32x4_epi32(lo_even, 0));
itoa_store_hi64(output_buffer + 1 * 24, hi_10);
itoa_store_u128(output_buffer + 1 * 24 + 8, _mm512_extracti32x4_epi32(lo_odd, 0));
itoa_store_lo64(output_buffer + 2 * 24, hi_32);
itoa_store_u128(output_buffer + 2 * 24 + 8, _mm512_extracti32x4_epi32(lo_even, 1));
itoa_store_hi64(output_buffer + 3 * 24, hi_32);
itoa_store_u128(output_buffer + 3 * 24 + 8, _mm512_extracti32x4_epi32(lo_odd, 1));
itoa_store_lo64(output_buffer + 4 * 24, hi_54);
itoa_store_u128(output_buffer + 4 * 24 + 8, _mm512_extracti32x4_epi32(lo_even, 2));
itoa_store_hi64(output_buffer + 5 * 24, hi_54);
itoa_store_u128(output_buffer + 5 * 24 + 8, _mm512_extracti32x4_epi32(lo_odd, 2));
itoa_store_lo64(output_buffer + 6 * 24, hi_76);
itoa_store_u128(output_buffer + 6 * 24 + 8, _mm512_extracti32x4_epi32(lo_even, 3));
itoa_store_hi64(output_buffer + 7 * 24, hi_76);
itoa_store_u128(output_buffer + 7 * 24 + 8, _mm512_extracti32x4_epi32(lo_odd, 3));
```

Since the full string spans across three 64-bit lanes we unpack the `A`

and `B`

registers to form `BBBBBBBB|AAAAAAAA`

pairs next to each other so they can be stored as 128-bit quantities. The remaining 8 bytes stored in the `C`

register are stored separately. Note that only 4 bytes in the `C`

register are actually used, but in our example we just store all 8 bytes.

# Possible Improvements

The code we have presented converts 64-bit integers to strings regardless of their values. It’s possible to optimize the conversion in case that all integers are reasonably small; in our case it would be less than 8 digits when stringified, reserving one character in each lane for a minus sign. In that case the split of the input registers into `A`

, `B`

, and `C`

would be avoided - we would only need the `A`

register, which would significantly simplify the length calculation and sign insertion steps. Considering that numbers stored in databases are not usually large this optimization could improve the overall conversion throughput.

# Performance Comparison

We have compared the performance of our implementation against C++17 `<charconv>`

(using a low-level `std::to_chars()`

function) and a standard C library `itoa()`

function. There are two flavors of our implementation - AVX-512 generic, which always converts all ranges of integers, and AVX-512 optimized, which contains two code paths, in which one optimizes for integers less than `10000000`

(note that integers in all lanes must pass the condition).

As can be seen above our implementation runs in constant time regardless of the inputs. The small integer optimization makes the conversion approximately 3 times faster if all integers are less than `10000000`

and doesn’t slow down the generic case. The most important point is that the implementation always runs faster than any scalar conversion even if the inputs are extremely small, like 1 or 2 digits in the output string.

# Conclusion

AVX-512 can be used to improve the performance of integer to string conversion if it’s known in advance that multiple integers can be converted at once. The performance improvement is significant for all possible inputs, even for those that are extremely small. The most important feature of our implementation is the possibility to convert 64-bit integers for all possible values, which needs at most 20 output characters per integer. We have highlighted this as there are SIMD optimized implementations that can only convert 64-bit integers that are not greater than 16 decimal digits.

The implementation provided in this article is a port of our real implementation that is written entirely in assembly and converts 16 integers at once (compared to 8 integers presented here). In this article we have used C++ and compiler intrinsics to make the implementation more straightforward and also easier to reuse in case anybody wants to use it in a C/C++ codebase.

### Try Sneller for Free

You can try Sneller right now on your own data **for free**
through our playground,
or you can sign up for Sneller Cloud,
which gives you access to automatic ingest from your S3 buckets and hundreds
of CPU cores on which to run your queries.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!