## Preface

k The simplest way to implement this is by linear scanning , That is to traverse the entire training data set , Calculate the distance between each training instance and the input instance . The time complexity is o（N）, When the number of instance points in the training dataset N When a large , It's very time consuming , In order to improve the k The efficiency of nearest neighbor search , Special data structures can be used to store training data , Count the number of times to reduce the distance . The common algorithm is kd Tree method .

## 1. kd—Tree

Kd-Tree, namely K-dimensional tree, It's a binary tree , In the tree are some K D data ,（ this K No K Near neighbor K）. In a K Build a tree on a dimensional data set Kd-Tree, It's equivalent to constantly using a hyperplane perpendicular to the coordinate axis to k Division of dimensional space , Constitute a series of k Dimension super rectangle area , Each node in the tree corresponds to K The hyperrectangular region of dimension （Hyperrectangle）.

### 1.1 BST（ Binary search tree ）

- If its left subtree is not empty , Then the value of all nodes in the left subtree is less than the value of its root node ;
- If its right subtree is not empty , Then the value of all nodes in the right subtree is greater than the value of its root node ;
- Its left 、 The right subtree is also a binary search tree ;

If the data is one-dimensional data . So structure BST The process is as follows ：

- Sort the data , Take the data in the middle as the root node ;
- The data before the root node data is used as the left subtree of the root node , The data after the root node data is used as the right subtree of the root node ;
- Recursive construction of left and right subtrees .

###### It will be one-dimensional space of BST Generalized to K After dimensional space , There will be some changes in the algorithms for creating and querying binary trees .

## 2. kd The structure of the tree

### 2.1 Cutting dimensions dim Selection of

In Mr. Li Hang's 《 Statistical learning method 》 In a Book , Is to select the coordinate axis to cut the space in turn , That's when the first segmentation , Choose the first dimension , Choose the second dimension for the second segmentation ……

mod It means to take the mold . Until all the instance points are segmented .

But according to the axis of rotation, there will be some problems . I read a blog , Here's an example ：

Blog links ：Kd-Tree Algorithm principle and open source implementation code

therefore , In order to avoid such a problem , The maximum variance method is used here , As the selection criteria of cutting dimension , namely ： Which dimension has a large variance in the current data center , Take that dimension as the dividing dimension .

### 2.2 The division of left and right subtrees is based on

And one dimension BST The structure is the same , First, sort the data according to the current cutting dimension , Then select the middle value as the root node , If the current dimension is less than the value of the root node , It's the left tree , If the current dimension is greater than the value of the root node , It's the right subtree . In this way, each subtree corresponds to a hyperrectangular region , The left subtree is a dimension whose value is less than the middle value , On the right subtree, the value of a dimension is greater than the middle value .

## 3. kd Tree construction algorithm

### 3.1 The algorithm process

### 3.2 give an example

This is Mr. Li Hang 《 Statistical learning method 》 An example of ：

## 4. utilize kd Tree for nearest neighbor search

Here is the nearest neighbor （k=1） For example , Nearest neighbor search takes advantage of the constructed kd The tree searches for the instance point nearest to the input instance , This method can be extended to k a near neighbor ,k The idea and algorithm process of nearest neighbor and nearest neighbor are the same , There are some details of the knowledge that need to be revised .

### 4.1 Nearest neighbor search algorithm

### 4.2 How to judge whether the other child node area of the parent node is away from x A closer point ？

From a geometric point of view , Is to judge with x Centered center And whether the hypersphere with the radius of the nearest distance intersects the hyperrectangle region corresponding to another sub node .

In implementation , We can ask for x The distance between regions corresponding to another child node .

In the process of constructing trees , Record the split dimension of each subtree dim And the split value m（

That is, the root node dim The value of the dimension）,x The distance from the subtree is |x(dim) - m|. When |x(dim) - m| When it is greater than the current nearest distance , It shows that there is no intersection between the region of another node and the hypersphere , If it is less than the current nearest distance , It shows that there is an intersection , There may be closer , Go to the area of another child node .

### 4.3 give an example

Constructed from above kd Trees, for example ： Set the input instance as (6,3)

The search process is as follows ：