One . Background introduction

It has always been a headache to do efficient similarity search in the vast data stream . such as , I'm searching for dogs app I read an article on , The recommender system should push the article closest to this article for me , All articles in the database are represented by vectors , So the problem we have to solve is “ Find the vectors that are closest to the vectors in this article ”, Then push out the articles corresponding to these vectors . There are thousands of articles in the database , All users have tens of millions of requests per second , We need a quick, accurate and relatively resource-saving way to solve this problem .

There are many ways to solve this problem ,PQ,Annoy,HNSW wait . Limited space , This article only introduces HNSW Algorithm .

Two . Plain thinking Simple idea map

Please look at the picture above . Suppose we have now 13 individual 2 Dimensional data vector , We put these vectors in a plane rectangular coordinate system , Hide coordinate system scale , Their positional relationship is shown in the figure above .

Simple search method ： Many people have such simple ideas in their minds , Connect some points to points , Form a lookup graph , Save it for later ; When I want to find the nearest pink point , I start from any black spot , Calculate the distance between it and the pink dot , The point connected to this arbitrary black dot is called “ Friends ”（ literal translation ）, And then I'm going to calculate all of the black dots “ Friends ” The distance from the pink dot , From all “ Friends ” Choose the closest point to the pink dot , Take this as the next point , Continue to follow the steps above . If the distance between the current black dot and the pink dot is greater than all “ Friends ” All close , Stop searching , This black dot is the closest we're looking for to the pink dot .

If you don't understand the text above , Let's give you an example . The goal is ： We're looking for the closest point to the pink dot . step ： Starting from any black spot , Let's pick any one here C Click on it. , Calculate C The distance between dot and pink dot , Save it for later , Calculate again C All of your friends （A,I,B） The distance from the pink dot （ There are many ways to calculate distance and measure , Here we use Euclidean distance , It's in two-dimensional physical space “ Near and far ”）, We calculated that B Closest to the pink dot , and B The distance ratio between a dot and a pink dot C The distance between the dot and the pink dot （ It's been calculated before ） A more recent , So we're going to use B Click to continue searching .B The distance between the dots and the pink dots is preserved ,B The friend of the order is E,A,C,I,H, Calculate the distance between them and the pink dot , obtain E The dot is closest to the pink dot , And E Point to B The dot is closer to the pink dot , So we choose E Point as the next lookup point .E The friend of the order is J,B,D,G, And then we found out J The spot is closest to the pink dot , however ,but,however,J The distance between a point and a pink point is greater than E It's a little further away , So the condition to terminate the search is satisfied , So we go back to E spot .

Plain thinking is called plain thinking because it has many disadvantages . First , We found that in the picture K The point can't be found , because K No friends, no friends , What do I do ？. secondly , If we want to find the nearest two points , And if there is no connection between these two points , That will greatly affect efficiency （ such as L and E spot , If L and E There's a connection , Then we can easily use the above method to find out the two closest points to the pink dot ）, What do I do ？. The last big problem ,D It really takes so much “ Friends ” Do you ？ How to determine who is who's friend ？

About K Point question , We stipulate that all data vector nodes must have friendly points in composition . About L and E The problem of , We stipulate that all distances are close in composition （ be similar ） To a certain extent, vectors must be friends with each other . About D Some questions , Weigh the time complexity of constructing this graph , We try to minimize the number of nodes “ Friends ” Number . With these three rules , Let's move on to the next chapter .

3、 ... and .NSW Algorithm

In graph theory, there is a good partition rule to solve the problem of simple ideas mentioned in the previous section ------ Delaunay （Delaunay） Triangulation algorithm , This algorithm can achieve the following requirements ：1, There's a dot in every picture “ Friends ”.2, Similar points are mutual “ Friends ”.3, All the connections in the diagram （ Line segment ） The least amount of . The effect is as follows . but NSW There is no Delaunay triangulation to form a Delaunay triangulation , One of the reasons is that the time complexity of Delaunay triangulation composition algorithm is too high , let me put it another way , The composition is too time consuming . The second reason is that Lauder's triangle is not always the most efficient , If the initial point is far away from the search point, we need to jump many times to find its adjacent point , need “ Expressway ” Mechanism （Expressway mechanism, In this case, there is a line connection between some remote points , In order to quickly find ）. In an ideal state , Our algorithm should not only meet the above three requirements , And the algorithm complexity is low , At the same time with the highway mechanism composition . NSW There is a picture in the paper , Black is the connection between neighboring points , The red line is “ Highway mechanism ” 了 . We from enter point Click in to find , When you look for a green dot near a node , You can use the red wire “ Highway mechanism ” Find the results quickly .

NSW The simple composition algorithm is here ： Insert points into the graph one by one , When the illustration is a whole new point , Through the simple search method in the simple idea （ By calculation “ Friends ” And the distance from the point to be inserted to determine which point is the next entry point ） Find the nearest to this new point m A little bit （m Set by the user ）, Connect the new point to m The connection of two points . Finished .

In the above sudden algorithm description, some readers will surely ask , It's that simple ？ Yes , It's that simple . Let's analyze the algorithm rationally . First , Our composition algorithm is randomly inserted point by point , This means that in the early days of graph building , It's possible to build “ Expressway ”. Suppose we're going to form 10000 A graph of two points , Set up m=4（ Each point has at least 4 individual “ Friends ”）, this 10000 There are two of them ,p and q, They have exactly the same coordinates . Let's say that in the process of insertion, we are at 10 Insert... Times p, In the 9999 Insert... Times q, Excuse me, p and q Who is more likely to have “ Expressway ”？ answer ： Because in the first place 10 At the next insertion , Only before 9 A little bit , So it can only be before 9 Choose the nearest one from the three points 4 A little bit （m=4） As “ Friends ”, and q There are more choices , front 9998 You can choose from any point , therefore q Of “ Friends ” Closer to the q,p Early stage “ Friends ” Not necessarily close to p, therefore p It's easier to have “ Expressway ”. Conclusion ： One point , The earlier you insert it, the easier it is to form a related “ Expressway ” Connect , The later the insertion, the more difficult it is to form the relevant “ Expressway ” Connect . So the beauty of this algorithm design is to throw away the Delaunay triangle composition , change to the use of sth. “ No brain added ”（NSW Naive insertion algorithm ）, It reduces the time complexity of the composition algorithm, but also brings a limited number of “ Expressway ”, Speed up the search .

Following pair NSM Simple composition algorithm process example , Readers who have read the above description can skip the next paragraph . We are right. 7 Two dimensional points for composition , User Settings m=3（ Each point looks for 3 A close friend ）. First of all, the starting point is A spot （ Random ）,A The dot inserts itself into the graph , So you can't pick “ Friends ”. And then there was B spot ,B It's just A You can choose , So connect BA, This is the second 1 Subconstruction . Then insert F spot ,F Only A and B You can choose , So connect FA,FB, This is the second 2 This structure . And then inserted C spot , similarly ,C It's just A,B,F Optional , Connect CA,CB,CF, This is the second 3 Subconstruction . The key is coming. , And then inserted E spot ,E The point is A,B,F,C You can only choose 3 A little bit （m=3） As “ Friends ”, According to the rules we talked about earlier , Choose the three closest , How to be sure of the latest ？ Simple search ！ from A,B,C,F Starting from any point , The starting point of calculation and E The distance and starting point of all “ Friends ” and E Distance of , Choose the nearest point as a new starting point , If the chosen point is the starting point itself , So look at our m Equal to several , If not enough , Just keep looking for the second closest point or the third closest point , Based on the principle of no repetition , Until I find 3 It's a little closer . thus , We found it E Three close points of , Connect EA,EC,EF, This is the fourth construction . The first 5 The secondary structure and the second structure 6 Secondary and E The insertion of dots as like as two peas , It's all in “ Ready-made ” In the graph of 3 The nearest node as “ Friends ”, And make connections .

The picture is over , Please pay attention to E Point and A The line of points , If I insert it on the basis of this graph 6 A little bit , this 6 One point is 3 And E Very close , Yes 3 And A Very close , Distance, then E Current 3 There's no point A, distance A Current 3 There's not one of them E, But because A and E It's an early addition to the composition ,A and E With the connection , We call this connection “ Expressway ”, It can improve the efficiency of searching （ When the entry point is E, To find the distance A Very close , We can go through AE Connect from E Directly to A, It's not a little step, a little step, a lot of jumps to A）.

About NSW That's the simple idea of the algorithm , Let's talk about optimization .

One . In the process of searching , In order to improve efficiency , We can create a scrap list , Points traversed in a search task are no longer traversed . In a search , We have calculated the distance between all friends of this point and the search point , And already know the right direction to jump , These results are unique , There is no need to take this path again , Because this path will bring us the same repetitive results , It makes no sense .

Two . During the search , In order to improve the accuracy , We can create a dynamic list , Put the nearest to the search point n Points are stored in the table , Parallel to this n At the same time “ Friends ” The distance from the point to be found , In these “ Friends ” Choose from n Points and dynamic columns n Join points , Select in and focus on n A recent friend , Update dynamic list .

Be careful , The insertion process is preceded by a search , So optimizing the search process is optimizing the insertion process . The following are given NSW Find steps . Let's look for q Dot m Nearest neighbor point .

1. Randomly select a point as the initial entry point , Create an empty scrap table g And dynamic lists c,g It's a long list ,c It's a fixed length s A list of （s>m）, Put the initial point in the dynamic list c（ Attach initial points and to be found q Distance information for ）, Making shadow lists of dynamic lists c'.

2. For dynamic lists c Find all the points in parallel “ Friends ”, View these “ Friends ” Whether it is stored in the scrap table g in , If there is , Then discard , If there is no , Will these   The remaining “ Friends ” It's on the scrap list g in （ In order to avoid subsequent repeated search , Take the wrong path ）.

3. Parallel computing of the remaining “ Friends ” Distance from the point to be found q Distance of , Put these points and their respective distance information into c.

4. For dynamic lists c duplicate removal , And then sort by distance （ Ascending ）, Before storage s Points and their distance information .

5. View the dynamic list c and c' Whether or not the same , If the same , End this search , Go back to the top of the dynamic list m results . If it's not the same , take c' The content of is updated to c Of     Content , Execution section 2 Step .

The insertion algorithm is simpler , The insertion algorithm is to use the search algorithm to find m individual （ User Settings ） The point closest to the point to be inserted , Connect them , Finished .

That's all NSW The whole content of the algorithm ,HNSW Namely NSW Optimize the version . If you need to digest the above , I suggest you look at the following tomorrow ,   Because I didn't understand his thesis in a day . To understand this algorithm , The author has written three papers ,NSW It's one of them , The other is HNSW, No   see NSW I can't understand HNSW, In addition, we also published fossil level papers on veronoy polygon and Delaunay triangulation , then Malkov Sir is NSW I was told in a paragraph of the paper that , It's constructed “ Pseudo Delaunay triangulation ” It has nothing to do with true Delaunay triangulation , It's just a random name , I almost choked to death from drinking water .

Four . Skip table structure

There is an ordered list , named sorted_link, There are n Nodes , Each node is an integer . Let's start with the header , Find the first t（0<t<n） Nodes       Need to jump a few times ？ answer ：t-1 Time （ you 're right , I am from 1 Start counting ）. hold n The nodes are divided into n The requirements of the second search , Look it all up , Need to jump a few times ？ answer ：          （0+1+2+3+.....+（n-1）） Time .

What if my list looks like this ？ This is no longer an ordered list , Here are three ordered lists + The hop table is composed of hierarchical connection pointers . If you look at this diagram, you can see its search process , First check the first floor , Then check the second floor , Then look up the third floor , And find the result . If the name described in the preceding paragraph is sorted_link To create such a skip list , Then put the sorted_link Check all the elements in, and it will cost （0+1+2+3+.....+（n-1）） Time? ？ Of course not . So how many times ？ If you really care about , Please search for key words “ Jump watch ”.

How to build a jump table ？ Three words , Flip a coin . about sorted_link Each node in the list tosses a coin , If you throw it right , Then the node enters the upper ordered chain surface , Every sorted_link The nodes in are 50% The probability of entering the upper level of the ordered list . Neutralize the previous ordered list sorted_link The same elements in the linked list are linked with pointers one by one . Again from sorted_link Toss a coin in the upper list ,sorted_link The nodes in the upper linked list are 50% It's possible to get to the very surface , amount to sorted_link There are each node in 25% The probability of getting to the surface . And so on .

This ensures that the surface is “ High speed channel ”, The bottom layer is fine search , This idea has been applied to NSW In the algorithm, , It's an upgrade -----HNSW.

5、 ... and .HNSW Algorithm

The author can be understood by a sketch in the paper malkov It means . The first 0 Layer , It's all the points in the dataset , You need to set a constant ml, Through the formula floor(-ln(uniform(0,1)) x ml) To figure out how many layers this point can go down to . In the formula x It's a multiply sign ,floor（） It means rounding down ,uniform（0,1） It means to take a value randomly from a uniform distribution ,ln（） Let's take logarithm . For the above three functions have any questions, please download Sogou search app, search , What has . It's OK. Look at the information flow , Scold Xiaobian , Sufu .

Here we are , About HNSW The description of the algorithm is basically over . Let's sort out its search process , From the surface （ The number in the picture above is Layer=2） Any point to start looking for , Select the closest friends of the entry point , Store them in a fixed length dynamic list , Don't forget to keep a copy of them in the scrap form as well , In case of being wronged . In a general way , In the x The next time you look up , First calculate the distance between the friend point and the point to be found in the dynamic list （ The picture above is a little green ） Distance of , Don't count friends recorded in the discard list , Update the scrap list after calculation , Don't take the wrong path , Then put these calculated friends into the dynamic list , To reorder , Leave the k A little bit , Look at this. k Points and pre update k Is that the same , If it's not the same , Continue to find , If it's the same , Return to the former m results .

When you insert composition , First, calculate the level to which this point can go , On each floor NSW Look for t The nearest neighbor , Connect them separately , Do this for each level graph , That's it .

We need to control a lot of parameters , First , Dynamic list at insert time c Size , Its size directly affects the insertion efficiency , And the quality of the composition ,size The bigger it is , The higher the quality of the graph , The less efficient the composition and search . secondly , A node has at least several “ Friends ”,“ Friends ” The more , The higher the quality of the graph , The less efficient the search is . The author also mentioned in his paper “max Number of friend connections ” This parameter , Set how many friends a node can have at most , To improve search efficiency , But if it is too small, it will affect the quality of the graph , Weigh it up . In the last paragraph ml It's you who control it , It's set too big , Fewer layers , Less memory consumption , But it seriously affects efficiency , Too large will consume a lot of memory and composition time . In the paper , The author makes a distinction between dynamic list length in search state and dynamic list length in insert state , You can do it by adjusting them “ Fine structure and rough search ” perhaps “ Fine looking for rough structure ”.

There may be some omissions in the article , Welcome to correct and criticize . If you are satisfied with my article , The money from coke can support me to continue writing , Welcome to reward .

https://blog.csdn.net/u011233351/article/details/85116719

https://cdmana.com/2020/12/20201225072637375W.html

Scroll to Top