k The simplest way to implement this is by linear scanning , That is to traverse the entire training data set , Calculate the distance between each training instance and the input instance . The time complexity is o（N）, When the number of instance points in the training dataset N When a large , It's very time consuming , In order to improve the k The efficiency of nearest neighbor search , Special data structures can be used to store training data , Count the number of times to reduce the distance . The common algorithm is kd Tree method .
Kd-Tree, namely K-dimensional tree, It's a binary tree , In the tree are some K D data ,（ this K No K Near neighbor K）. In a K Build a tree on a dimensional data set Kd-Tree, It's equivalent to constantly using a hyperplane perpendicular to the coordinate axis to k Division of dimensional space , Constitute a series of k Dimension super rectangle area , Each node in the tree corresponds to K The hyperrectangular region of dimension （Hyperrectangle）.
- If its left subtree is not empty , Then the value of all nodes in the left subtree is less than the value of its root node ;
- If its right subtree is not empty , Then the value of all nodes in the right subtree is greater than the value of its root node ;
- Its left 、 The right subtree is also a binary search tree ;
If the data is one-dimensional data . So structure BST The process is as follows ：
- Sort the data , Take the data in the middle as the root node ;
- The data before the root node data is used as the left subtree of the root node , The data after the root node data is used as the right subtree of the root node ;
- Recursive construction of left and right subtrees .
It will be one-dimensional space of BST Generalized to K After dimensional space , There will be some changes in the algorithms for creating and querying binary trees .
In Mr. Li Hang's 《 Statistical learning method 》 In a Book , Is to select the coordinate axis to cut the space in turn , That's when the first segmentation , Choose the first dimension , Choose the second dimension for the second segmentation ……
mod It means to take the mold . Until all the instance points are segmented .
But according to the axis of rotation, there will be some problems . I read a blog , Here's an example ：
Blog links ：Kd-Tree Algorithm principle and open source implementation code
therefore , In order to avoid such a problem , The maximum variance method is used here , As the selection criteria of cutting dimension , namely ： Which dimension has a large variance in the current data center , Take that dimension as the dividing dimension .
And one dimension BST The structure is the same , First, sort the data according to the current cutting dimension , Then select the middle value as the root node , If the current dimension is less than the value of the root node , It's the left tree , If the current dimension is greater than the value of the root node , It's the right subtree . In this way, each subtree corresponds to a hyperrectangular region , The left subtree is a dimension whose value is less than the middle value , On the right subtree, the value of a dimension is greater than the middle value .
This is Mr. Li Hang 《 Statistical learning method 》 An example of ：
Here is the nearest neighbor （k=1） For example , Nearest neighbor search takes advantage of the constructed kd The tree searches for the instance point nearest to the input instance , This method can be extended to k a near neighbor ,k The idea and algorithm process of nearest neighbor and nearest neighbor are the same , There are some details of the knowledge that need to be revised .
4.2 How to judge whether the other child node area of the parent node is away from x A closer point ？
From a geometric point of view , Is to judge with x Centered center And whether the hypersphere with the radius of the nearest distance intersects the hyperrectangle region corresponding to another sub node .
In implementation , We can ask for x The distance between regions corresponding to another child node .
In the process of constructing trees , Record the split dimension of each subtree dim And the split value m（ That is, the root node dim The value of the dimension ）,x The distance from the subtree is |x(dim) - m|. When |x(dim) - m| When it is greater than the current nearest distance , It shows that there is no intersection between the region of another node and the hypersphere , If it is less than the current nearest distance , It shows that there is an intersection , There may be closer , Go to the area of another child node .
Constructed from above kd Trees, for example ： Set the input instance as (6,3)
The search process is as follows ：