Blazing Fast Unicode-aware ILIKE in AVX-512

by Henk-Jan Lebbink | April 17, 2023

Blazing fast Unicode-aware ILIKE with AVX-512

In this blog post, we will explain in detail how Sneller enables field ILIKE '%unicode%' queries to achieve speeds of 1.3GiB/s/core, compared to 1.4GiB/s/core decompression speeds.

Previously, we talked about our development of a SQL query engine entirely written in AVX-512, and how we store data such that columns can be compressed without using a columnar database. In this post, we will discuss another functionality: a case-insensitive comparison of UTF-8-encoded strings. This implementation will utilize 16 parallel lanes, without the need for branches, and is designed for an Intel SkylakeX. We use this implementation for a high performance ILIKE operator.

Case-Folding

During the process of case-folding or capitalization, letters are replaced with alternative letters that differ only in their casing. When performing a case-insensitive search, the search term is compared to elements within a corpus, and the casing of the elements is disregarded. One naive approach to performing a case-insensitive search is to take an element from the corpus, compare it to the search term, and if they are not equal, change the casing of the corpus element and compare again. This process is repeated until all possible ways to fold have been exhausted or a match is found. Our more efficient approach is to compile a so-called “extended search term” that includes all possible variations of the original search term’s foldings.

But first some observations: from the total code space of more than a million (1,112,064) unicode code-points:

  • 12 code-points have three alternatives:
    • U+0398 Θ, U+03B8 θ, U+03D1 ϑ and U+03F4 ϴ are equal under case-folding, same for:
    • U+0345 ͅ, U+0399 Ι, U+03B9 ι, U+1FBE
    • U+0422 Т, U+0442 т, U+1C84 and U+1C85
  • 72 code-points have two alternatives.
  • 2714 code-points have one alternative.
  • 1,109,266 code-points have no alternatives.

Given these numbers, if you are searching for a string that solely consists of letters lacking case-related alternatives, then the string has only one form, rendering case-insensitive matching unnecessary. You can simply compare the two sequences byte by byte, even for non-ASCII values. It is worth noting that numerous scripts, such as Chinese, Arabic, and Japanese, do not have capital letters, while others, like the Latin and Cyrillic scripts, do. Additionally, if the search term has only ASCII values (with one alternative) then a very simple and performant implementation exists. On the other hand, we will focus on more complex situations where the search term includes at least one non-ASCII character with an alternative character.

For example, ASCII U+0073 s has as alternatives U+0053 S, but less known also U+017F ſ. Any search term that contains an s needs a Unicode aware ILIKE implementation because the data could contain a U+017F ſ that will match with the search term. Another example is in German, although the Eszett U+00DF ß is typically not written at the start of a word or sentence, an official capital form U+1E9E was introduced in 2017. The word füße (feet) is capitalized as FÜẞE. The diacritic ü has an alternative diacritic Ü. However, it is important to note that an alternative official spellings of ß is the collated double character ss. It should also be emphasized that alternative spellings (of a word or letter) are not considered equal under case-folding. Therefore, U+0075 u is not considered an alternative for U+00FC ü, nor is ss considered an alternative for an Eszett ß.

Sneller VM

Some basic knowledge about our VM architecture will be helpful to understand the text. We will be utilizing various capabilities of this VM, such as the ability to handle 16 data streams concurrently. The S register serves as a virtual register and stores information related to the 16 strings being matched. It is mapped to two physical 512-bit vector registers, in the examples here, ZMM2 and ZMM3. Specifically, ZMM2 contains 16 rows of 32-bit offsets that are relative to our data pointer RSI, while ZMM3 holds the 16 byte lengths for each string. It’s worth noting that lanes can be disabled, for instance due to a preceding SQL “WHERE” clause. To handle this, the K virtual register is used and contains 16 predicate arguments for the ILIKE instruction. In the subsequent example, the K register is mapped to the physical 16-bit mask register K1. This same K register is also utilized to return the outcome of the ILIKE expression.

Overview of the implementation

One efficient method for implementing this is to create an extended search term that contains the original search term’s letter and three alternatives for each letter, or repeat the last alternative if there are fewer than three alternatives. For example, the word füße would become fFFFüÜÜÜßẞẞẞeEEE in the extended search term. Then, a generic assembly program can be used to check if the first unicode character in the 16 lanes of data matches one of the first four unicode characters in the extended search term, and if the second character matches one of the next four characters in the extended search term, and so on. By using UTF-8 byte sequences in the extended search term instead of unicode code-points, the sixteen data lanes (which are UTF-8 encoded) can be compared (with minor alterations) to the extended search term. As a result, all the complexities involved in case-insensitive searching, with 16 lanes in parallel, can be reduced to four compare instructions and an OR operation. This will be explained in Step 3.

For example, search term füße has a UTF-8 encoding as the hex byte sequence 66.c3.bc.c3.9f.65. The extended search term contains for every code-point in the search term four equivalent UTF-8 encoded code-points, padded with zeros to form a 4-byte sequence. String füße would then contain the following bytes:

  • 66.00.00.00 (U+0066 f padded with zeros)
  • 46.00.00.00 (U+0046 F)
  • 46.00.00.00 (U+0046 F)
  • 46.00.00.00 (U+0046 F)
  • C3.BC.00.00 (U+00FC ü)
  • C3.9C.00.00 (U+00DC Ü)
  • C3.9C.00.00 (U+00DC Ü)
  • C3.9C.00.00 (U+00DC Ü)
  • C3.9F.00.00 (U+00DF ß)
  • E1.BA.9E.00 (U+1E9E )
  • E1.BA.9E.00 (U+1E9E )
  • E1.BA.9E.00 (U+1E9E )
  • 65.00.00.00 (U+0065 e)
  • 45.00.00.00 (U+0065 E)
  • 45.00.00.00 (U+0065 E)
  • 45.00.00.00 (U+0065 E)

To implement an SQL clauses such as WHERE table ILIKE '%füße%', we require a case-insensitive (CI) substring functionality that is Unicode-aware. For simpler clauses such as WHERE table ILIKE 'füße', a Unicode-aware CI string equality functionality is sufficient. The former can be implemented as a repeated string equality comparison, where we first scan for the first character of the search term for all active lanes, and then use the CI string equality function for the matching positions in the data. However, since our focus is on the Unicode-aware CI string comparison, we can concentrate on implementing the CI string equality functionality.

Step 1: Gather data

Collect the next four bytes from each of the 16 lanes that are active into Z8. For instance, if data in the first lane contains the string füße, then the first 32 bits in Z8 will hold the value 66.c3.bc.c3.

loop:
KMOVW         K1,  K3             ;copy eligible lanes    ;K3=tmp_mask; K1=lane_active
VPGATHERDD    (SI)(Z2*1),K3,  Z8  ;gather data            ;Z8=data; K3=tmp_mask; SI=data_ptr; Z2=data_offset

Step 2: Calculate number of bytes in UTF-8

We can use a lookup table to determine the number of bytes used in a UTF-8 sequence, and then use this information to look up the mask that removes the bytes of the next character(s) that we are not interested in. The upper four bits of the first byte in a UTF-8 sequence indicate the number of bytes in the sequence. For example, a byte with the binary representation of 0b0xxx.xxxx indicates that zero bytes follow, meaning that the current byte is an ASCII value. A byte with the binary representation of 0b10xx.xxxx indicates that one byte follows, 0b110x.xxxx indicates that two bytes follow, and 0b1110.xxxx indicates that three bytes follow. Our decompression algorithms includes validity checks for the input string to ensure that it is valid UTF-8. This is important because determining the number of bytes used in a UTF-8 sequence only works with valid strings. Thus, we can create a lookup table with 16 keys that map to the number of bytes in the UTF-8 byte sequence. For instance, 0b000'1101 and 0b0000'1100 map to three bytes.

Let’s take the example of the character f in UTF-8 byte sequence 66.c3.bc.c3. Since this character is represented by one byte, we get the mask ff.00.00.00 and keep 66.00.00.00 by performing an AND operation. In the next step, we will compare this value to the four UTF-8 codes in the extended search term.

VPSRLD        $4,  Z8,  Z26    ;shift 4 bits to right   ;Z26=scratch_Z26; Z8=data
VPERMD        Z21, Z26, Z7     ;get n_bytes_data        ;Z7=n_bytes_data; Z26=scratch_Z26; Z21=table_n_bytes_utf8
VPERMD        Z18, Z7,  Z19    ;get tail_mask           ;Z19=tail_mask; Z7=n_bytes_data; Z18=tail_mask_data
VPANDD        Z8,  Z19, Z8     ;remove tail from data   ;Z8=data; Z19=tail_mask

Step 3: Compare

Register Z8 contains the current character encoded as UTF-8 padded with zeros, what we now only need to do is compare this value to the current four characters in the extended search term. If either of the four comparisons match we have found a case-insensitive match of our unicode character.

VPCMPD.BCST   $0,  (R11)(DX*1),Z8,   K1,  K3  ;K3 := K1 & (data==[ptr+idx]);K3=tmp_mask; K1=lane_active 
VPCMPD.BCST   $0,  4(R11)(DX*1),Z8,  K1,  K4  ;K4 := K1 & (data==[ptr+idx+4]);K4=alt2_match; K1=lane_active 
VPCMPD.BCST   $0,  8(R11)(DX*1),Z8,  K1,  K5  ;K5 := K1 & (data==[ptr+idx+8]);K5=alt3_match; K1=lane_active 
VPCMPD.BCST   $0,  12(R11)(DX*1),Z8, K1,  K6  ;K6 := K1 & (data==[ptr+idx+12]);K6=alt4_match; K1=lane_active 
KORW          K3,  K4,  K3                    ;tmp_mask |= alt2_match         
KORW          K3,  K5,  K3                    ;tmp_mask |= alt3_match          
KORW          K6,  K3,  K1                    ;lane_active := tmp_mask | alt4_match

It could be argued that alternative assembly implementations could be called based on the search term, considering only two or even just one character as an alternative, as characters with three alternatives are very rare. This would avoid unnecessary comparisons for the repeated last alternative if there are fewer than three character alternatives. However, it has been found that such optimizations only provide a marginal speed increase (although this is not shown here).

Step 4: Advance and loop

We now proceed to update the offset into the data (Z2) and reduce the number of bytes remaining (Z3). We increment the offset in the extended search term by four characters (for these four characters we need 0x20 bytes) and decrement the number of Unicode code-points. Then, we jump if there are still code-points left.

VPADDD        Z7,  Z2,  K1,  Z2   ;data_start += n_bytes_data   ;Z2=data_start; K1=lane_active; Z7=n_bytes_data
VPSUBD        Z7,  Z3,  K1,  Z3   ;data_len -= n_bytes_data     ;Z3=data_len; K1=lane_active; Z7=n_bytes_data
ADDQ          $20, R14            ;needle_ptr += 20             ;R14=needle_ptr
DECL          CX                  ;needle_len--                 ;CX=needle_len
JG            loop                ;jump if greater ((ZF = 0) and (SF = OF))

The code shown above in step 3 is used for comparing when either the four characters in the extended search term, or the four bytes in the data contain at least one non-ASCII value. However, we also have a performance improvement (not discussed here) that checks if all four characters are ASCII value. If they are, we can do four case-insensitive ASCII comparisons at once.

You can find the complete code for the CI Unicode-aware string equality functionality and the CI Unicode-aware contains substring functionality in our GitHub repository.

Simple Benchmarking

We perform some simple benchmarking to briefly show the speed differences between two string compare features. They are case-sensitivity and the iterative nature of substring versus string equality tests. We use the 11.5GB ClickBench dataset provided by ClickHouse. These results are with 16 threads on an 8 core Intel 11900K with 64GB DDR4-3200.

Our implementation of case-sensitive LIKE queries without metacharacters (_ and %) only uses the most basic memory comparisons, which provides a baseline for more complex computations. In scenario 1 the data needs to read from disk, it needs to be decompressed, and is compared to a byte string.

-- scenario 1
$ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title LIKE 'schließen'"
22.9GiB/s (scanned 61984473088 B (57.73 GB) in 2.6425099s)

If we switch to the substring functionality, we introduce an iterative aspect to the algorithm where we first search for the first character (which is U+0073 s) and then compare the search term to all the matching lanes. The case-sensitive string comparison begins by comparing the byte lengths of the search term and the data; if they are unequal, the two strings cannot match, which allows for a highly efficient implementation. Unfortunately, this optimization is not possible with the substring implementation, simply because non-ASCII values are encoded with more than one byte. Despite this, the string equality comparison is 2% faster than the substring comparison, which is surprising. The reason for this is that the speed of memory access more heavily constrains string equality implementations, while substring comparisons are less constrained.

-- scenario 2
$ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title LIKE '%schließen%'"
22.5GiB/s (scanned 61984473088 B (57.73 GB) in 2.695599s)

If we change to a case-insensitive string equality tests we obtain the same results as for case-sensitive string equality. Comparing byte length of both search term and data cannot be used here, because the UTF-8 byte sequences of characters equal under case-folding do not need to have the same byte length. Thus, despite not having the length check, this implementation is still memory bound. If we were to delve deeper into the performance analysis of this code, we will realize that the latency of the VPGATHERD instruction is the major factor that affects its speed. Even if we add a costly VPCONFLICTD instruction after the VPGATHERD (without creating a dependency on the result), the code will still remain memory-bound. If we aim to further improve the speed of this already high-performing feature, we must examine the memory access pattern of the VPGATHERD.

-- scenario 3
$ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title ILIKE 'schließen'"
22.9GiB/s (scanned 61984473088 B (57.73 GB) in 2.6463799s)

Let us now examine a noteworthy scenario involving both case-insensitivity and substring functionality.

-- scenario 4
$ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title ILIKE '%schließen%'"
21.1GiB/s (scanned 61984473088 B (57.73 GB) in 2.8640074s),

The performance difference between scenario 1: the case-sensitive equality comparison and scenario 4: the case-insensitive Unicode aware substring implementation is that 4 is only 8% slower. This observation makes sense as both LIKE 'schließen' and ILIKE '%schließen%' are memory-bound operations, but the implementation of LIKE is much simpler compared to ILIKE. Running the code iteratively incurs a few percent speed decrease for LIKE and somewhat more for ILIKE.

Wrap up

We’ve provided some insights into our implementation of a Unicode-aware, case-insensitive ILIKE operator. One observation we made was that, by analyzing the search term, we can determine the most appropriate implementation - there’s no need to perform an expensive case-insensitive Unicode-aware comparison if the search term only contains Chinese characters with no case variations. Another observation we made is that, by compiling what we call an “extended search term,” we can simplify the assembly code and perform most of the work at compile time. This approach requires only four compare instructions to compare the search term with the data. As a result, our implementations are memory-bound for string comparison and yield a substring implementation that is just 8% slower compared to the simplest string comparison.

In followup blog posts we will share details of other string functions: to give you two pointers: regular expressions and fuzzy string search.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!