 So-called “K a near neighbor （K-nearest neighbor,K-NN）”, seeing the name of a thing one thinks of its function , refer to “K The nearest neighbor ”, It's a supervised learning method .

## 1. working principle

A brief introduction K The working mechanism of the nearest neighbor algorithm ： First, give a set of training sets , As a reference to the algorithm ; And then give specific test objects , That is, test data without labels , The algorithm will find... In the training set In a sense The closest to it is K Training data , And according to this K A label of training data is used to determine the type of test data （ Classification problem ） Or numerical value （ The return question ）.

from K The principle of the nearest neighbor algorithm can be seen , After getting the training data, there is no so-called “ Training process ”, We just need “ wait every day under the tree , in the hope that a hare would kill itself by crashing into a tree trunk ”, Wait for external input of a test data , Then compare with the training data . This is also K The nearest neighbor algorithm is called “ Lazy study （lazy learning）” The reason for the algorithm .

### 1.1 Distance metric

Here's what we're talking about “ In a sense ”, Actually, it means something Distance metric Methods . So-called “ distance ”, It can be simply understood as the difference between two data , We can quantitatively find out the difference between two data by using distance measurement . One of the most familiar distance measurement methods is “ Euclidean distance ”, That's what we're most familiar with “ Linear distance ”. Use this formula , We can accurately measure the difference between any given data by numerical value .

And then there's the Manhattan distance 、（ Chess ） Chessboard distance and other distance measurement methods . The Manhattan distance is actually the sum of the differences between the two points , It's like driving in a well planned block .

### 1.2 Intuitively explain

informally ,K The central idea of the nearest neighbor algorithm is that we all understand 、 Life experience that we all recognize ： People gather by analogy , birds of a feather flock together .

For a particular person , We should have a comprehensive understanding of him as soon as possible , We are used to observing his circle of friends （ cough , Note that it doesn't mean wechat circle of friends ）, The few people he interacts with most can basically reflect most of his own characteristics .

It's the same for other objects in the real world , The abstraction of these objects —— data —— No exception, of course . For a given test data , We can look at the data with the highest similarity , The general characteristics of these data , We naturally think that it will also be the feature of this test data .

### 1.3 About K value

K The selection of values seems to be more casual , But it's also a science . You could say yes K A super parameter of the nearest neighbor algorithm , That is, it was set manually by the researchers 、 Parameters that are not automatically adjusted for the algorithm .

K   It's not good to choose too much . The most extreme situation , If you don't mind setting it directly K =  N, That is the total number of samples of training data , No matter what kind of test data is given at this time , They all come to the same conclusion —— The algorithm will directly take the most existing tags in the training data samples as the tags of the test data . thus , The difference between the data is completely erased , It doesn't show any value of classification or regression .“ too ” It's not the socialism we're after .

that  K Is there any problem in choosing too small ？ Yes, there are. . Also consider the most extreme situations ——K = 1, This is the same. K   A special case of the nearest neighbor algorithm , So-called “ Nearest neighbor algorithm ”. When the label of test data only depends on the closest training data , We all know that something must have gone wrong —— The deviation of a certain data is just an example , But it happened to affect our judgment of the test data .

From the above analysis, we can draw a social lesson ： Dictatorship is not good , The blind expansion of democracy is also a lasting disaster ~ Centralized democracy is a good thing .

Coming back to the book , Seriously , Through the above analysis , We can come to a conclusion ：K The value of is not suitable to be too large , But it should not be too small . This is more troublesome , So on the one hand ,K Setting values is an empirical job ; On the other hand , We can use the training data at hand , Randomly set aside a part as test data for repeated testing , So as to select a more suitable K value . And usually , This “ Appropriate K value ” Not too big .

### 1.4 Decision making

K There are many decision-making methods of nearest neighbor algorithm , Let's briefly introduce two ways that apply to different problems .

1） Classification problem

For the classification problem , It is generally used Majority voting system , in other words ,K In a neighborhood , The tags with the largest proportion are the tags of test data .

2） The return question

For the return question , May adopt Weighted average , in other words ,K In a neighborhood , Different weights are given according to the distance from the test data , Then average .

## 2. Algorithm implementation

To reduce the burden of reading , In this paper, we only give the results of the classification problem K Neighbor algorithm implementation . The realization of the regression problem remains for readers to explore .

In addition, due to the limitation of the author's level , If there is any problem with the code, I hope the readers will give me their correction .

Strictly speaking , According to Mr. Li Hang 《 Statistical learning method 》 The contents of the book , Official application  K The nearest neighbor algorithm should also build kd   Trees , In order to search the nearest neighbor , Improve algorithm efficiency . After all , In real industrial applications , The algorithm faces millions of data and above , If we don't use well-designed data structure to improve retrieval efficiency , It will bring about a substantial increase in costs and a competitive disadvantage .

But here we are just trying to understand K Nearest neighbor algorithm , I have a perceptual understanding of this algorithm , So it doesn't involve kd The implementation of the tree （ In fact, the author thinks this part is a bit troublesome —— Of course, it is also to reduce the reading burden of readers , Otherwise, we need to introduce it again kd The principle of trees , It's a bit lengthy ）.

### 2.1 Divide the data

We use a data set that is often used as a benchmark for machine learning algorithms —— Iris data set , As an example dataset for this article .

Readers can learn from https://www.kaggle.com/uciml/iris download , It can also be obtained from the supporting code of this article .

Let's briefly introduce this iris dataset , All in all  150 This is the size data of iris , It belongs to three different Iris species , namely setosa、versicolor、virginica Each of the three Iris species 50   Copy of the data （ Limited by the author's level , There is no translation for specific species names , To avoid misunderstanding , We all know that there are three different kinds of iris ）.

The size data include calyx length 、 Calyx wide 、 The petals are long 、 The petal width is four , The unit is centimeter , Corresponding to each piece of data , There is a corresponding category label and the data itself in the dataset id. Some data examples are as follows ：

`Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species1,5.1,3.5,1.4,0.2,Iris-setosa2,4.9,3.0,1.4,0.2,Iris-setosa...51,7.0,3.2,4.7,1.4,Iris-versicolor52,6.4,3.2,4.5,1.5,Iris-versicolor...101,6.3,3.3,6.0,2.5,Iris-virginica102,5.8,2.7,5.1,1.9,Iris-virginica...`

What we're going to do is we're going to do it based on a given four item length data , adopt K Nearest neighbor algorithm to determine which kind of iris an instance belongs to .

And now the problem is , We only have one data set , Can't find another real iris data . So we need to divide the original data set , Use part of it as a training set , The other small part is a test set . The ratio is up to the reader , We choose  4:1 The proportion of , That is to say, choose from each iris 40 Data as a training set , Set aside 10 As a test set .

Be careful , Here, the selection of training set and test set should be random , So as to avoid unnecessary human error .

First , Reading data ：

`import pandas as pdiris = pd.read_csv('iris.csv').to_numpy()`

`pd.read_csv()` Read CSV What is returned after the file is a DataFrame Data of type , Use `to_numpy()` Method can be converted to Numpy Array .

Then initialize the container of each data ：

`train_data = [] #  Training data , Press... With the test data  4:1  Proportion Division test_data = [] #  Test data train_target = [] #  The training data corresponds to the label test_target = [] #  The test data corresponds to the label `

Because there are three categories , Namely 0~49,50~99,100~149, So use the loop , And set an offset `offset`, On this basis, slice the array ：

``import numpy as np``for i in range(3):``    offset = 50 * i``    data = iris[offset+0:offset+50, :]``    np.random.shuffle(data) #  Random disruption on the spot ``    train_data.extend(data[0:40, 1:5].tolist())``    train_target.extend(data[0:40, 5].tolist())``    test_data.extend(data[40:, 1:5].tolist())``    test_target.extend(data[40:, 5].tolist())``

For the convenience of loading data into the prepared container , We use `tolist()` Method , Convert sliced data into a list .

And then, for ease of use Numpy Performance of （ It doesn't really matter here ） And its various functions （ Main cause ）, We're going to convert that data into Numpy Array ：

`train_data = np.array(train_data)test_data = np.array(test_data)train_target = np.array(train_target)test_target = np.array(test_target)`

Come here , The data we need is ready .

### 2.2 Calculated distance

In this step, we calculate the distance between the test data and the training data .

Get ready first “ distance ” The container of ：

`distance = []`

Then traverse each data in the test set ; For each test data , We need to traverse every data in the training set , Distance measure we choose Euclidean distance ：

``for i in range(len(test_data)):``    sub_dist = []``    for j in range(len(train_data)):``        dif_array = test_data[i] - train_data[j]``        dist = np.sqrt(np.sum(dif_array * dif_array))``        sub_dist.append(dist)`` ``    distance.append(sub_dist)``distance = np.array(distance)``

among ,`sub_dist` Used to store test data for specific purposes , Corresponding to each training data “ distance ”, Save as a one-dimensional list .`distance` It is made up of several `sub_dist` A two-dimensional list of . Finally, put `distance` It also translates into Numpy Array .

The step of distance calculation is completed .

### 2.3 Solution result

alike , First, we initialize the container for the results ：

`results = []`

This step needs to be used “K a near neighbor ” In this K 了 . To make the program more flexible , We choose to be specified by the user at run time K value ：

`K = int(input(" Please enter  K  value ："))`

And then according to `distance` Size , You can know how many test data there are in total , Use `np.argsort()` Get the first... After ascending sort K An index of distance . Then use the array of indexes to filter the labels of training data . Finally using `Counter` Class `most_common()` Method to get the most frequent element , Add the prepared container ：

`for i in range(len(distance)):    index = np.argsort(distance[i])[:K]    result = train_target[index]     species = Counter(result).most_common(1)     results.append(species)`

here ,`results` That is to say, it has been preserved in China K All the results of the nearest neighbor algorithm .

### 2.4 Evaluate the results

First initialize a positive and negative counter ：

`right = 0`

Take out the expected and actual categories of test data respectively , And print ; Determine whether the expected category is consistent with the actual category , Depending on the match, decide whether to increase the value of the calculator ：

`for i in range(len(results)):    print('Right Species = ', test_target[i], \        ', \tReal Species = ', results[i])    if results[i] == test_target[i]:        right += 1`

such , We can also get K The accuracy of the nearest neighbor algorithm , And print ：

`right_rate = right / len(results)print("Right Rate: ", right_rate)`

To this step , The implementation of the whole algorithm is completed .

Next we can run it and see .

### 2.5 Run the example

Here is an example of the result of the run ：

`PS D:\justdopython\KNN> python KNN.py-------------------------------------------------------------------------------- Please enter  K  value ：5--------------------------------------------------------------------------------Right Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-virginicaRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-virginicaRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginica--------------------------------------------------------------------------------Right Rate:  0.9333333333333333--------------------------------------------------------------------------------`

## 3. summary

This article gives you a preliminary understanding of the so-called K Nearest neighbor algorithm , And give the concrete algorithm step by step . If some students feel that it's a bit boring to pile up the code slowly , In this paper “ Sample code ” The complete implementation of the above code is also given in , You can go to “ Sample code ” Download the corresponding code directly and run it to see the result , It might be more interesting .

When you get interested , I hope readers can re implement the code described in this article .

in addition , Considering that there may be some differences in the size range of sepals and petals , In fact, there needs to be a “ normalization ” Steps for , This step is omitted in this paper . Readers can do it on their own, and then look at the results .

Sample code ：https://github.com/JustDoPython/python-100-day/tree/master/

## Reference material

https://book.douban.com/subject/26708119/https://book.douban.com/subject/10590856/https://www.kaggle.com/uciml/irishttps://blog.csdn.net/pipisorry/article/details/51822775https://cloud.tencent.com/developer/ask/153164 Series articles