# Accelerating Fuzzy Search using AVX-512

by Henk-Jan Lebbink | May 8, 2023

## Accelerating Fuzzy Search using AVX-512

In this post, we will explore our SQL fuzzy string compare and contains functionality that allows multi GiB/s/core without the need for preprocessing or indexing.

Fuzzy string comparison is a valuable technique used for comparing strings that may be similar but not exactly the same. It is particularly useful when working with natural language data or when users make spelling errors. SQL provides several methods for implementing fuzzy string search, but we will focus on a method that works well with massive datasets, including those that are terabyte or petabyte in size. In such large-scale scenarios, we need methods that do not require preprocessing or indexing and only need to read data once. We at Sneller are interested in simplified planning and maintenance of schema-less terabyte storage systems by making preprocessing redundant.

### Edit Distance

Fuzzy search methods often use an edit distance metric. In our approach, we will use an estimation approach based on the Damerau-Levenshtein distance. One might wonder why we use estimation instead of calculating the true edit distance. As we will see, calculating true distances can be extremely difficult and may not be practical for fuzzy functions that inherently involve some degree of imprecision.

The term “edit distance” refers to a mathematical concept used to describe string comparison systems that are closed under specific string operations. There are several different distance measures that can be constructed using these operations, including:

• Substitution, which results in Hamming distance
• Transposition, which results in Jaro-Winkler distance
• Substitution, deletion, and insertion, which results in Levenshtein distance
• Substitution, deletion, insertion, and transposition, which results in Damerau-Levenshtein distance.

The Sneller query tool supports two SQL functions: the `EQUALS_FUZZY` function checks if a given data string matches a provided string literal by allowing a certain number of string edits. The function estimates the minimum number of edits required to transform the data string into the literal string. If the estimated number of edits is below a specified threshold, the two strings are considered to have a fuzzy match. In other words, the function checks if the data string can be transformed into the literal string with no more than a certain number of edits. Similarly, the `CONTAINS_FUZZY` function checks if a string literal is present in a data string by allowing a certain number of edits.

When searching through terabytes of data, the term “minimal” in “minimal edit distance” should be cause for concern. The number of possible edit sequences can be prohibitively large, and determining whether a particular sequence is minimal often requires examining every solution in the set. It is likely that computing the minimal edit distance is an NP-complete problem, and the most efficient algorithm available may require quadratic time. However, for functions as inherently imprecise as fuzzy search, finding exact edit distances may not be necessary. An approximation approach can meet our needs and enable very efficient implementations.

The strings ‘abcde’ and ‘acfee’ can have multiple edit sequences. To provide an example, one possible sequence involves substituting the second character of the second string ‘b’ with ‘c’ at position 2, replacing ‘c’ with ‘f’ at position 3, and substituting ’d’ with ’e’ at position 4. Another sequence could involve deleting ‘b’ at position 2, inserting ‘f’ at position 3, and substituting ’d’ with ’e’ at position 4. Although these are small strings, they still have a significant number of possible solutions, and we need to find the minimal sequence among them.

### Edit Distance Approximation

Our approximation approach involves examining three consecutive Unicode characters at a time. More precisely, we first analyze the current character, and then we look ahead two characters to determine the minimum number of string operations needed to match three characters in the data string with three characters in the given string literal.

To explain how these approximations work, it is easiest to first consider a scenario where we do not look ahead any characters. We have two strings and two current positions in both strings. We have an accumulated edit distance of the strings before the current position, and we want to compute how much the distance increases if we consider the current position. Since we do not look beyond the current position, we cannot determine if the edit is a transposition, deletion, or insertion; we can only see substitutions. A transposition occurs when two characters that are next to each other in one string are swapped in the other string. With this zero-character look-ahead, we compute the Hamming distance between the two strings. Here is the pseudocode for this approach:

``````WHILE (DATA not empty) OR (NEEDLE not empty) DO
D0 := DATA
N0 := NEEDLE

IF (D0==N0) // the first characters match
ELSE
ENDIF
ENDDO
``````

If the approximation considers a one-character look-ahead, it can observe at the current position: single insertions, deletions, transpositions, and two substitutions. For example, if the data is `abc` and the needle is `bac`, the minimal edit distance would be 1. It would be classified as a transposition because the first character of the data (`D0`) equals the second character of the needle (`N1`), and vice versa.

``````WHILE (DATA not empty) OR (NEEDLE not empty) DO
D0 := DATA
N0 := NEEDLE

IF (D0==N0) // the first characters match
ELSE
D1 := DATA
N1 := NEEDLE

// character is deleted in data
IF (D1!=N0) && (D0==N1)
ENDIF

// character is inserted in data
IF (D1==N0) && (D0!=N1)
ENDIF

// characters are transposed in data
IF (D1==N0) && (D0==N1)
ENDIF

// all remaining situations: character is substituted in data
IF (D1!=N0) && (D0!=N1)
ENDIF
ENDIF
ENDDO
``````

Note that we may observe two substitutions, but we only process the first one. This is because our approximation method yields better results if we handle the second substitution in the context of the next “current position”. For instance, the distance between the two strings may be more accurately classified with an insertion when it is considered alongside the subsequent characters.

As we increase our look-ahead to two characters, our approximation method becomes more complex. At the current position, we can detect one or two deletions, one or two insertions, one transposition, or one transposition interleaved with one insertion. All other differences between the strings are considered as a single substitution.

Although there could be other combinations of operations, such as a transposition followed by a substitution, we have found that the following pseudocode produces the most accurate estimate of the actual edit distance when using a two-character look-ahead. To find this optimal estimation method, we generated all possible strings of length less than 8 using an alphabet of 4 characters, calculated the estimated edit distance, and subtracted the true edit distance. The total of all differences represents the cumulative error of the estimation method. We selected the method with the lowest cumulative error from the entire range of estimation methods. Only several percentages of the estimated edit distances were overestimated while the remainder was estimated correctly.

``````WHILE (DATA not empty) OR (NEEDLE not empty) DO
D0 := DATA
N0 := NEEDLE

IF (D0==N0) // the first characters match
ELSE
D1 := DATA
D2 := DATA
N1 := NEEDLE
N2 := NEEDLE

// two characters are transposed in data
IF (N0!=D0) && (N0==D1) && (N1==D0)
ENDIF

// one character is deleted in data
IF (N0!=D0) && (N0!=D1) && (N1==D0) && (N2==D1)
ENDIF

// two characters are deleted in data:
IF (N0!=D0) && (N1!=D1) && (N2!=D2) && (N2==D0) && (N0!=D1) && (N1!=D2) && (N1!=D0) && (N2!=D1) && (N0!=D2)
ENDIF

// one character is inserted in data
IF (N0!=D0) && (N0==D1) && (N1==D2) && (N1!=D0)
ENDIF

// two characters are inserted in data
IF (N0!=D0) && (N1!=D1) && (N2!=D2) && (N2!=D0) && (N0!=D1) && (N1!=D2) && (N1!=D0) && (N2!=D1) && (N0==D2)
ENDIF

// transposition and insertion ab -> bca:
IF (N0!=D0) && (N1!=D1) && (N2!=D2) && (N2!=D0) && (N0!=D1) && (N1!=D2) && (N1==D0) && (N2!=D1) && (N0==D2)
ENDIF

// substitution in all remaining situations:
ENDIF
ENDDO
``````

This approximation only overestimate the edit distance between two strings, which means that sometimes the distance is estimated to be three when it is actually only two. As a result, when searching for strings with a maximum edit distance of three, some results with a true edit distance of three may be missed because they were estimated to have a distance of four. While this may appear to be a problem, the benefit of this approximation is that it allows us to find high edit distances at speeds comparable to string comparisons. We will discuss this more in detail later.

#### Fuzzy Match vs. Fuzzy Contains

The difference between a fuzzy match and a fuzzy contains is that in the former two strings are compared for a match, while in the latter a substring is matched at all positions in another string. Broadly speaking, a fuzzy contains is an iterated fuzzy match, making a fuzzy contains significantly more computationally expensive compared to a fuzzy match. Additionally, we have two versions of the implementation: one for ASCII-only fuzzy matching and another for Unicode-aware fuzzy matching.

Some examples:

``````-- string literal 'cache' is treated as a byte sequence
EQUALS_FUZZY('Cash', 'cache', 1) -> FALSE
EQUALS_FUZZY('Cash', 'cache', 2) -> TRUE
EQUALS_FUZZY('Cash', 'cache', 3) -> TRUE

-- string literal 'Straße' is treated as a unicode sequence
EQUALS_FUZZY_UNICODE('strasse', 'Straße', 1) -> FALSE
EQUALS_FUZZY_UNICODE('strasse', 'Straße', 2) -> TRUE

-- string literal 'Straße' is treated as a byte sequence
-- (note that `ß` is a 2-byte sequence)
EQUALS_FUZZY('strasse', 'Straße', 1) -> FALSE
EQUALS_FUZZY('strasse', 'Straße', 2) -> TRUE

-- string literal 'Fox Jumps' has a distance of 3 to 'foks jums'
CONTAINS_FUZZY('The quick brown foks jums over the lazy dog', 'Fox Jumps', 3) -> TRUE
``````

### Too much talk and little code!

Let’s take a look at some actual code that demonstrates how to implement the pseudocode in a branchless manner using 16 parallel lanes and the instruction set of a SkylakeX. As mentioned earlier, we have two versions: one for matching only ASCII characters and another for matching Unicode characters. Although they are conceptually similar, the ASCII version is much simpler because we can gather three ASCII bytes for 16 lanes with a single `VPGATHERDD` instruction. In contrast, the Unicode version requires three `VPGATHERDD` instructions to load the first three Unicode values. In this blog post, we will focus on the ASCII implementation.

#### Step 1: Gather data and needle

Step 1 consists of loading four ASCIIs from the lanes that still need to be processed for both the data and the needle.

``````loop:
VPCMPD        \$6,  Z11, Z3,  K2,  K4    ; K4 := K2 & (data_len>0)     ;K2=lane_todo; Z3=data_len; Z11=0
VPCMPD        \$6,  Z11, Z13, K2,  K5    ; K5 := K2 & (needle_len>0)   ;K2=lane_todo; Z13=needle_len; Z11=0
KMOVW         K4,  K3                   ; K3 := K4                    ;
VPGATHERDD    (SI)(Z2*1),K3,  Z8        ; gather data                 ;Z8=data; SI=data_ptr; Z2=data_off
VPGATHERDD    (R14)(Z5*1),K5,  Z9       ; gather needle               ;Z9=needle; R14=needle_ptr; Z5=needle_off
``````

#### Step 2: Remove Tail

Since we use a two-character look-ahead, meaning that we only consider the first three characters, we’re setting the last byte of each lane to `0xFF`. To discard the tail and set bytes that do not belong to the data to `0xFF`, we calculate the minimum value between three and the remaining byte length of the data (`Z3`), which serves as a key to retrieve the mask using a `VPERMD` lookup. Then, we use an OR operation to apply the mask to the data, which is the fastest method to achieve this. It’s worth noting that we don’t have to handle the tail of the needle because our Go code pads the needle with sufficient `0xFF` bytes.

``````VPMINSD       Z3,  Z24, Z26             ; scratch1 := min(3, data_len) ;Z26=scratch1; Z24=3; Z3=data_len
VPORD         Z8,  Z26, K4,  Z8         ; data |= scratch1             ;Z8=data; K4=scratch1; Z26=scratch1
``````

#### Step 3: Put ASCII into Uppercase

We consider ASCII values that are equivalent under case-folding to have an edit distance of zero, meaning that lowercase and uppercase ASCII values are not distinguished from each other. To convert a lowercase ASCII character to uppercase, we check if its value is greater than or equal to ASCII `a` and less than or equal to `z`. If this condition is true, we need to set the 6th bit to 1, which can be done using the `VPTERNLOGD` instruction. However, in the Unicode-aware version of the code, code-points that are equal under case-folding are not treated as having a zero edit distance, as implementing such functionality leads to unsatisfiable register pressure.

``````VPCMPB        \$5,  Z16, Z8,  K3         ; K3 := (data>=char_a)        ;K3=tmp_mask; Z8=data; Z16=char_a
VPCMPB        \$2,  Z17, Z8,  K3,  K3    ; K3 &= (data<=char_z)        ;K3=tmp_mask; Z8=data; Z17=char_z
VPTERNLOGD    \$0b01001100,Z15, Z8,  Z26 ;                             ;Z26=scratch1; Z8=data; Z15=c_0b00100000
``````

#### Step 4: Shuffle and Compare

We perform two shuffles on the needle to create three versions of it: the original ‘012’, as well as ‘120’ and ‘201’. We refer to these bytes as `N0`, `N1`, and `N2`, respectively. We then compare these with the data ‘012’, which we refer to as `D0`, `D1`, and `D2`. The three mask registers (`K3`, `K4`, and `K5`) contain the results of the comparisons: `K3` shows whether `N0==D0`, `N1==D1`, and `N2==D2`. K4 shows whether `N1==D0`, `N2==D1`, and `N0==D2`. Finally, `K5` shows whether `N2==D0`, `N0==D1`, and `N1==D2`.

``````VPSHUFB       Z25, Z9,  Z27             ;                              ;Z27=scratch2; Z9=needle; Z25=ror_a3
VPSHUFB       Z25, Z27, Z28             ;                              ;Z28=scratch3; Z27=scratch2; Z25=ror_a3
VPCMPB        \$0,  Z26, Z9,  K3         ; K3 := (needle==scratch1)     ;K3=tmp_mask; Z9=needle; Z26=scratch1
VPCMPB        \$0,  Z26, Z27, K4         ; K4 := (scratch2==scratch1)   ;K4=scratch1; Z27=scratch2; Z26=scratch1
VPCMPB        \$0,  Z26, Z28, K5         ; K5 := (scratch3==scratch1)   ;K5=scratch2; Z28=scratch3; Z26=scratch1
``````

#### Step 5: Fuzzy Match Logic

With these three mask registers, we can determine the edit distance increase and the number of bytes to advance for both the data and needle. We accomplish this by merging the bits of the three mask registers into a 9-bit key and using `VPGATHERDD` to find the required values. For those interested in a challenge, we use a clever technique to shuffle bits ‘a’ to ‘g’ in the mask registers into the 9 least significant bits in register `Z26`.

``````VMOVDQU8.Z    Z19, K3,  Z26             ; 0000_0000 0000_000a 0000_000b 0000_000c; K3=tmp_mask; Z19=0x10101
VMOVDQU8.Z    Z20, K4,  Z27             ; 0000_0000 0000_00d0 0000_00e0 0000_00f0; K4=scratch1; Z20=0x20202
VMOVDQU8.Z    Z21, K5,  Z29             ; 0000_0000 0000_0g00 0000_0h00 0000_0i00; K5=scratch2; Z21=0x40404
VPTERNLOGD    \$0b11111110,Z29, Z27, Z26 ; 0000_0000 0000_0gda 0000_0heb 0000_0ifc;
VPMADDUBSW    Z26, Z22, Z26             ; 0000_0000 0000_0gda 0000_0000 00he_bifc; Z22=0x10801
VPMADDWD      Z26, Z23, Z26             ; 0000_0000 0000_0000 0000_000g dahe_bifc; Z23=0x400001
``````

Our Go code will prepare a pointer that contains the values needed for the fuzzy match logic. For example, consider the case where the data is “bcd” and the needle is “abc”. This is a situation where one character has been deleted from the data, as per the pseudocode.

``````// one character is deleted in data
IF (N0!=D0) && (N0!=D1) && (N1==D0) && (N2==D1)
``````

The condition `(N0!=D0) && (N0!=D1) && (N1==D0) && (N2==D1)` results in a specific 9-bit key, and at the corresponding index of the key, `editDistance += 1; advanceData += 1; advanceNeedle += 2` is stored.

``````KMOVW         K2,  K3               ; tmp_mask := lane_todo  ;K3=tmp_mask; K2=lane_todo;
VPGATHERDD    (R15)(Z26*1),K3,  Z27 ; retrieve how to update ;Z27=scratch2; K3=tmp_mask; R15=update_ptr; Z26=key;
``````

#### Step 6: Unpack Results

It is quite straightforward to extract the deltas for the edit distance (`Z28`), advance needle (`Z9`) and advance data (`Z8`) from the previously gathered data (`Z27`).

``````VPSRLD        \$4,  Z27, Z28       ; ed_delta := scratch2>>4         ;Z28=ed_delta; Z27=scratch2
VPANDD        Z24, Z28, Z28       ; ed_delta &= 3                   ;Z28=ed_delta; Z24=3
VPANDD        Z24, Z27, Z8        ; adv_data := scratch2 & 3        ;Z8=adv_data; Z27=scratch2; Z24=3
``````

#### Step 7: Update

The next step is quite simple: update the edit distance, advance data, and advance needle using the deltas that were extracted in the previous step.

``````VPADDD        Z28, Z4,  K2,  Z4   ; edit_dist += ed_delta           ;Z4=edit_dist; K2=lane_todo; Z28=ed_delta
VPSUBD        Z8,  Z3,  K2,  Z3   ; data_len -= adv_data            ;Z3=data_len; K2=lane_todo; Z8=adv_data
VPSUBD        Z9,  Z13, K2,  Z13  ; needle_len -= adv_needle        ;Z13=needle_len; K2=lane_todo; Z9=adv_needle
``````

#### Step 8: Are we done?

If the edit distance exceeds the provided threshold value (`Z14`), the corresponding lanes are labeled as dead and are subsequently removed from the active lanes.

``````VPCMPD        \$2,  Z14, Z4,  K2,  K2  ; K2 &= (edit_dist<=threshold)  ;K2=lane_todo; Z4=edit_dist; Z14=threshold
``````

In the final step, if there are no more characters left in both the needle and the data, and the lane is still in the to-do list, it means we have found a fuzzy match. We add this lane to the list of active lanes (`K1`) and remove it from the lanes to-do (`K2`). If there are still lanes left to process, we go back to step 1 to collect new data and continue with the matching procedure.

``````VPCMPD        \$5,  Z13, Z11, K2,  K3  ; K3 := K2 & (0>=needle_len)  ;K3=tmp_mask; K2=lane_todo; Z13=needle_len
VPCMPD        \$5,  Z3,  Z11, K3,  K3  ; K3 &= (0>=data_len)         ;K3=tmp_mask; Z11=0; Z3=data_len
KTESTW        K2,  K2                 ; any lanes still todo?       ;K2=lane_todo
JNZ           loop                    ; yes, then loop; jump if not zero (ZF = 0)
``````

The code can be found in our GitHub repository.

### Simple Benchmarking

Next, we will present some basic benchmarking results. We did not compare our system with other fuzzy search systems such as PostgreSQL with trigram indexes because we do not create indexes. Creating a trigram index for 227GB of text just to check for spelling errors is no fun. Instead, we run speed tests and compare results to disk I/O and compression speeds. Specifically, we use the dataset provided by ClickHouse for their ClickBench benchmark, which is a compressed JSON file hits.json.gz which is 21.1GB in size and contains 227GB of uncompressed text.

We have compressed the data (227GB) with the Sneller open-source self-describing compressed table format (`.zion`) with the following command:`pack -o=clickbench.iguana.zion -c=zion+iguana_v0 hits.json.gz`. This results in a file `clickbench.iguana.zion` of 11.5GB. To obtain statistics, we use the `-S` flag with the Sneller query engine.

#### Fuzzy Equality

``````-- scenario 1
\$ sneller -j -S "SELECT COUNT(Title) FROM 'clickbench.iguana.zion' WHERE Title = ' biden '"
22.8GiB/s (scanned 61984473088 B (57.73 GB) in 2.6588645s)
``````

The Sneller query engine only needs to decompress the Title column which is 57.73 GB of scanned data which is only a fragment of the total 227GB of uncompressed data. It could process the query at 22.7GiB of data per second, with 16 threads on an 8 core Intel 11900K with 64GB DDR4-3200. The total time to finish the query is 2.65 seconds.

``````-- scenario 2
\$ sneller -j -S "SELECT COUNT(Title) FROM 'clickbench.iguana.zion' WHERE EQUALS_FUZZY(Title, ' biden ', 1)"
22.8GiB/s (scanned 61984473088 B (57.73 GB) in 2.6530886s)
``````

Note that the fuzzy equality has the same run time as the string equality comparison from scenario 1; the difference is not significant.

#### Fuzzy Contains

``````-- scenario 3
\$ sneller -j -S "SELECT COUNT(Title) FROM 'clickbench.iguana.zion' WHERE Title ILIKE '% biden %'"
22.7GiB/s (scanned 61984473088 B (57.73 GB) in 2.6628499s)
``````
``````-- scenario 4
\$ sneller -j -S "SELECT COUNT(Title) FROM 'clickbench.iguana.zion' WHERE CONTAINS_FUZZY(Title, ' biden ', 1)"
11.4GiB/s (scanned 61984473088 B (57.73 GB) in 5.3130959s)
``````

The case-insensitive ASCII-only version of `CONTAINS_FUZZY` (scenario 4) treats ASCII characters that are equal under case-folding as not requiring an edit, which means we can use `CONTAINS_FUZZY(Title, ' biden ', 0)` to compare its performance against `Title ILIKE '% biden %'` (scenario 3), which produces the exact same results. We see that loading 11.5GB, decompressing 57.73GB of the table Title, and comparing doing a unicode string contains (scenario 3) takes the same time as the most simple byte wise memory compare (scenario 1), which means that the computations in scenario 3 are still memory bound.

As the table below demonstrates, the highly optimized Unicode-aware ILIKE function is 40% faster than CONTAIN_FUZZY, which is still a respectable speed considering the non-trivial fuzzy matching capability it offers.

Execution times for thresholds 0 to 5: Reasonable threshold values do not affect the runtime of fuzzy equals, whereas for fuzzy contains, there is a linear relationship between threshold and runtime. Fuzzy equals makes comparisons at the start of the data string and up to the threshold, and thus the number of data strings affects the runtime, but not the length of the data string. Fuzzy contains starts at the first position and if there’s no match, restarts from the next position, considering all positions of the data string if the pattern is not found. Thus, the length of the data string significantly affects the runtime.

Moreover, the runtime of ASCII contains is approximately 1.8 times faster than Unicode contains. Increasing the threshold by 1 leads to a 70% increase in runtime of the contains functions.

### Wrap up

In this post we have provided details about our implementation of a fuzzy match and fuzzy contains functionality that does not need any indexes. We see a linear relation between the length of the search string and the runtime of the contains functionality. We also see a linear relation between the edits threshold and the runtime. And although fuzzy contains functionality is not a memory bound computation we still observe good performance for common and acceptable thresholds.

In other blog posts we performed similar string comparisons, for example regular expression matching or case-insensitive unicode matching.