choice Index Not obvious , There are several questions that can help you choose Index.
Whether accurate results are needed
IndexFlat2 It's the only one that can guarantee accurate results Index. It's for the other Index Provides a comparison standard . It doesn't compress vectors , Tagging is not supported , You can only add . therefore , If you need
Is there a memory limit
Please note that Faiss be-all Index It's all stored in RAM Inside , If you don't need accurate results , At the same time RAM It is limited. , Within that limit , We're in precision - Speed is measured in the middle of the selection . Consider the following questions ：
Plenty of memory, unlimited ： Use HNSWx
If you have a lot of RAM Or the search library is small ,HNSW Is the best choice . It's fast. 、 It's accurate .4<=x<=64 Is the number of links per vector , The higher the value, the more accurate , But RAM The more . And speed - Precision is measured by efSearch Parameter adjustment is realized , The number of memory occupied by each vector is (d4+x2*4) byte .
HNSW Sequence addition is not supported (not
add_with_ids), alike , if necessary , Then use
IDMap.HNSW No training required , And we don't support it index Delete vector .
Memory doesn't care , Use "...,Flat"
"..." It refers to the data set clustering operation that needs to be done . After clustering ,“Flat” Just put these vectors in the bucket , So there's no Yasso , The amount of storage space occupied is equal to the original data set . Speed - Precision is measured by
nprobe Parameter control .
Memory is limited , Use “PCARx,...,SQ8”
If storing the entire vector takes up a lot of memory resources , There are two ways to reduce the usage ：
- PCA Dimension reduction ;
- Each vector element is quantifiable to 1 byte ;
therefore , The total amount of storage is per vector x byte .
Memory is very limited , Use “OPQx_y, ..., PQx”
PQx Use PQ The algorithm compresses each vector and outputs it x Byte encoding ,x Generally less than or equal to 64, For big coding ,SQ Usually more accurate 、 Fast .OPQ It's a linear transformation of a vector , Make it easier to compress .y It represents a dimension , Need to meet ：
- y yes x Factor of ;
- y<=d,d It's the dimension of the input vector ;
- y <= 4*x
How big is the data set
This question is about the above "..." Part of the answer . Data sets are aggregated into buckets , Only a part of the bucket is traversed during the search （nprobe A barrel ）. This clustering operation only works on representative samples of dataset vectors , It's usually a sample set of data sets . We give the optimal size selection for the sample set ：
If lower 1M vector :"...,IVFx,..."
among x yes 4sqrt(N)16sqrt(N),N Is the size of the dataset . It's just using kmeans Put the vectors together , You need 30x256x Vector training （ More is better ）.
among x It's the sign , Not numbers .
IMI Also on the training set k-means clustering , Coalescence 210 Center point , But the first half and the second half of the vector are independent . Increase the number of clusters to 2*(210). You need about 64*(2^10) Vector training .
If 10M-100M:"...,IMI2x12,..."< Tens of millions - Billion level >
Same as above ,10 use 12 Replace .
If 100M-1B："...,IMI2X14,..."< Billion level - Billion >
Same as above ,10 use 14 Replace .