choice Index Not obvious , There are several questions that can help you choose Index.

** Whether accurate results are needed **

Use Flat.`IndexFlat2`

It's the only one that can guarantee accurate results Index. It's for the other Index Provides a comparison standard . It doesn't compress vectors , Tagging is not supported , You can only add . therefore , If you need `add_with_ids`

, Use `IDMap,Flat`

.

** Is there a memory limit **

Please note that Faiss be-all Index It's all stored in RAM Inside , If you don't need accurate results , At the same time RAM It is limited. , Within that limit , We're in precision - Speed is measured in the middle of the selection . Consider the following questions ：

** Plenty of memory, unlimited ： Use HNSWx**

If you have a lot of RAM Or the search library is small ,HNSW Is the best choice . It's fast. 、 It's accurate .4<=x<=64 Is the number of links per vector , The higher the value, the more accurate , But RAM The more . And speed - Precision is measured by efSearch Parameter adjustment is realized , The number of memory occupied by each vector is (d*4+x*2*4) byte .

HNSW Sequence addition is not supported (not `add_with_ids`

), alike , if necessary , Then use `IDMap`

.HNSW No training required , And we don't support it index Delete vector .

** Memory doesn't care , Use "...,Flat"**

"..." It refers to the data set clustering operation that needs to be done . After clustering ,“Flat” Just put these vectors in the bucket , So there's no Yasso , The amount of storage space occupied is equal to the original data set . Speed - Precision is measured by `nprobe`

Parameter control .

** Memory is limited , Use “PCARx,...,SQ8”**

If storing the entire vector takes up a lot of memory resources , There are two ways to reduce the usage ：

- PCA Dimension reduction ;
- Each vector element is quantifiable to 1 byte ;

therefore , The total amount of storage is per vector x byte .

** Memory is very limited , Use “OPQx_y, ..., PQx”**

PQx Use PQ The algorithm compresses each vector and outputs it x Byte encoding ,x Generally less than or equal to 64, For big coding ,SQ Usually more accurate 、 Fast .OPQ It's a linear transformation of a vector , Make it easier to compress .y It represents a dimension , Need to meet ：

- y yes x Factor of ;
- y<=d,d It's the dimension of the input vector ;
- y <= 4*x

** How big is the data set **

This question is about the above "..." Part of the answer . Data sets are aggregated into buckets , Only a part of the bucket is traversed during the search （nprobe A barrel ）. This clustering operation only works on representative samples of dataset vectors , It's usually a sample set of data sets . We give the optimal size selection for the sample set ：

** If lower 1M vector :"...,IVFx,..."**

among x yes 4sqrt(N)16sqrt(N),N Is the size of the dataset . It's just using kmeans Put the vectors together , You need 30x256x Vector training （ More is better ）.

** If 1M-10M："...,IMI2x10,..."**

among x It's the sign , Not numbers .

IMI Also on the training set k-means clustering , Coalescence 210 Center point , But the first half and the second half of the vector are independent . Increase the number of clusters to 2*(210). You need about 64*(2^10) Vector training .

** If 10M-100M:"...,IMI2x12,..."< Tens of millions - Billion level >**

Same as above ,10 use 12 Replace .

** If 100M-1B："...,IMI2X14,..."< Billion level - Billion >**

Same as above ,10 use 14 Replace .

https://www.cnblogs.com/imagezy/p/8329229.html

版权声明

本文为[osc_ bk3qeylz]所创，转载请带上原文链接，感谢

https://cdmana.com/2020/12/20201225072637378i.html