picture

So-called “K a near neighbor (K-nearest neighbor,K-NN)”, seeing the name of a thing one thinks of its function , refer to “K The nearest neighbor ”, It's a supervised learning method .

1. working principle

A brief introduction K The working mechanism of the nearest neighbor algorithm : First, give a set of training sets , As a reference to the algorithm ; And then give specific test objects , That is, test data without labels , The algorithm will find... In the training set In a sense The closest to it is K Training data , And according to this K A label of training data is used to determine the type of test data ( Classification problem ) Or numerical value ( The return question ).

from K The principle of the nearest neighbor algorithm can be seen , After getting the training data, there is no so-called “ Training process ”, We just need “ wait every day under the tree , in the hope that a hare would kill itself by crashing into a tree trunk ”, Wait for external input of a test data , Then compare with the training data . This is also K The nearest neighbor algorithm is called “ Lazy study (lazy learning)” The reason for the algorithm .

1.1 Distance metric

Here's what we're talking about “ In a sense ”, Actually, it means something Distance metric Methods . So-called “ distance ”, It can be simply understood as the difference between two data , We can quantitatively find out the difference between two data by using distance measurement . One of the most familiar distance measurement methods is “ Euclidean distance ”, That's what we're most familiar with “ Linear distance ”.

image.png

Use this formula , We can accurately measure the difference between any given data by numerical value .

And then there's the Manhattan distance 、( Chess ) Chessboard distance and other distance measurement methods . The Manhattan distance is actually the sum of the differences between the two points , It's like driving in a well planned block .

1.2 Intuitively explain

informally ,K The central idea of the nearest neighbor algorithm is that we all understand 、 Life experience that we all recognize : People gather by analogy , birds of a feather flock together .

For a particular person , We should have a comprehensive understanding of him as soon as possible , We are used to observing his circle of friends ( cough , Note that it doesn't mean wechat circle of friends ), The few people he interacts with most can basically reflect most of his own characteristics .

It's the same for other objects in the real world , The abstraction of these objects —— data —— No exception, of course . For a given test data , We can look at the data with the highest similarity , The general characteristics of these data , We naturally think that it will also be the feature of this test data .

1.3 About K value

K The selection of values seems to be more casual , But it's also a science . You could say yes K A super parameter of the nearest neighbor algorithm , That is, it was set manually by the researchers 、 Parameters that are not automatically adjusted for the algorithm .

K   It's not good to choose too much . The most extreme situation , If you don't mind setting it directly K =  N, That is the total number of samples of training data , No matter what kind of test data is given at this time , They all come to the same conclusion —— The algorithm will directly take the most existing tags in the training data samples as the tags of the test data . thus , The difference between the data is completely erased , It doesn't show any value of classification or regression .“ too ” It's not the socialism we're after .

that  K Is there any problem in choosing too small ? Yes, there are. . Also consider the most extreme situations ——K = 1, This is the same. K   A special case of the nearest neighbor algorithm , So-called “ Nearest neighbor algorithm ”. When the label of test data only depends on the closest training data , We all know that something must have gone wrong —— The deviation of a certain data is just an example , But it happened to affect our judgment of the test data .

From the above analysis, we can draw a social lesson : Dictatorship is not good , The blind expansion of democracy is also a lasting disaster ~ Centralized democracy is a good thing .

Coming back to the book , Seriously , Through the above analysis , We can come to a conclusion :K The value of is not suitable to be too large , But it should not be too small . This is more troublesome , So on the one hand ,K Setting values is an empirical job ; On the other hand , We can use the training data at hand , Randomly set aside a part as test data for repeated testing , So as to select a more suitable K value . And usually , This “ Appropriate K value ” Not too big .

1.4 Decision making

K There are many decision-making methods of nearest neighbor algorithm , Let's briefly introduce two ways that apply to different problems .

1) Classification problem

For the classification problem , It is generally used Majority voting system , in other words ,K In a neighborhood , The tags with the largest proportion are the tags of test data .

2) The return question

For the return question , May adopt Weighted average , in other words ,K In a neighborhood , Different weights are given according to the distance from the test data , Then average .

2. Algorithm implementation

To reduce the burden of reading , In this paper, we only give the results of the classification problem K Neighbor algorithm implementation . The realization of the regression problem remains for readers to explore .

In addition, due to the limitation of the author's level , If there is any problem with the code, I hope the readers will give me their correction .

Strictly speaking , According to Mr. Li Hang 《 Statistical learning method 》 The contents of the book , Official application  K The nearest neighbor algorithm should also build kd   Trees , In order to search the nearest neighbor , Improve algorithm efficiency . After all , In real industrial applications , The algorithm faces millions of data and above , If we don't use well-designed data structure to improve retrieval efficiency , It will bring about a substantial increase in costs and a competitive disadvantage .

But here we are just trying to understand K Nearest neighbor algorithm , I have a perceptual understanding of this algorithm , So it doesn't involve kd The implementation of the tree ( In fact, the author thinks this part is a bit troublesome —— Of course, it is also to reduce the reading burden of readers , Otherwise, we need to introduce it again kd The principle of trees , It's a bit lengthy ).

2.1 Divide the data

We use a data set that is often used as a benchmark for machine learning algorithms —— Iris data set , As an example dataset for this article .

Readers can learn from https://www.kaggle.com/uciml/iris download , It can also be obtained from the supporting code of this article .

Let's briefly introduce this iris dataset , All in all  150 This is the size data of iris , It belongs to three different Iris species , namely setosa、versicolor、virginica Each of the three Iris species 50   Copy of the data ( Limited by the author's level , There is no translation for specific species names , To avoid misunderstanding , We all know that there are three different kinds of iris ).

The size data include calyx length 、 Calyx wide 、 The petals are long 、 The petal width is four , The unit is centimeter , Corresponding to each piece of data , There is a corresponding category label and the data itself in the dataset id. Some data examples are as follows :

Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species1,5.1,3.5,1.4,0.2,Iris-setosa2,4.9,3.0,1.4,0.2,Iris-setosa...51,7.0,3.2,4.7,1.4,Iris-versicolor52,6.4,3.2,4.5,1.5,Iris-versicolor...101,6.3,3.3,6.0,2.5,Iris-virginica102,5.8,2.7,5.1,1.9,Iris-virginica...

What we're going to do is we're going to do it based on a given four item length data , adopt K Nearest neighbor algorithm to determine which kind of iris an instance belongs to .

And now the problem is , We only have one data set , Can't find another real iris data . So we need to divide the original data set , Use part of it as a training set , The other small part is a test set . The ratio is up to the reader , We choose  4:1 The proportion of , That is to say, choose from each iris 40 Data as a training set , Set aside 10 As a test set .

Be careful , Here, the selection of training set and test set should be random , So as to avoid unnecessary human error .

First , Reading data :

import pandas as pdiris = pd.read_csv('iris.csv').to_numpy()

pd.read_csv() Read CSV What is returned after the file is a DataFrame Data of type , Use to_numpy() Method can be converted to Numpy Array .

Then initialize the container of each data :

train_data = [] #  Training data , Press... With the test data  4:1  Proportion Division test_data = [] #  Test data train_target = [] #  The training data corresponds to the label test_target = [] #  The test data corresponds to the label 

Because there are three categories , Namely 0~49,50~99,100~149, So use the loop , And set an offset offset, On this basis, slice the array :

import numpy as np
for i in range(3):    offset = 50 * i    data = iris[offset+0:offset+50, :]
   np.random.shuffle(data) # Random disruption on the spot
   train_data.extend(data[0:40, 1:5].tolist())    train_target.extend(data[0:40, 5].tolist())    test_data.extend(data[40:, 1:5].tolist())    test_target.extend(data[40:, 5].tolist())

For the convenience of loading data into the prepared container , We use tolist() Method , Convert sliced data into a list .

And then, for ease of use Numpy Performance of ( It doesn't really matter here ) And its various functions ( Main cause ), We're going to convert that data into Numpy Array :

train_data = np.array(train_data)test_data = np.array(test_data)train_target = np.array(train_target)test_target = np.array(test_target)

Come here , The data we need is ready .

2.2 Calculated distance

In this step, we calculate the distance between the test data and the training data .

Get ready first “ distance ” The container of :

distance = []

Then traverse each data in the test set ; For each test data , We need to traverse every data in the training set , Distance measure we choose Euclidean distance :

for i in range(len(test_data)):    sub_dist = []    for j in range(len(train_data)):        dif_array = test_data[i] - train_data[j]        dist = np.sqrt(np.sum(dif_array * dif_array))        sub_dist.append(dist)     distance.append(sub_dist)
distance = np.array(distance)

among ,sub_dist Used to store test data for specific purposes , Corresponding to each training data “ distance ”, Save as a one-dimensional list .distance It is made up of several sub_dist A two-dimensional list of . Finally, put distance It also translates into Numpy Array .

The step of distance calculation is completed .

2.3 Solution result

alike , First, we initialize the container for the results :

results = []

This step needs to be used “K a near neighbor ” In this K 了 . To make the program more flexible , We choose to be specified by the user at run time K value :

K = int(input(" Please enter  K  value :"))

And then according to distance Size , You can know how many test data there are in total , Use np.argsort() Get the first... After ascending sort K An index of distance . Then use the array of indexes to filter the labels of training data . Finally using Counter Class most_common() Method to get the most frequent element , Add the prepared container :

for i in range(len(distance)):    index = np.argsort(distance[i])[:K]    result = train_target[index]     species = Counter(result).most_common(1)[0][0]     results.append(species)

here ,results That is to say, it has been preserved in China K All the results of the nearest neighbor algorithm .

2.4 Evaluate the results

First initialize a positive and negative counter :

right = 0

Take out the expected and actual categories of test data respectively , And print ; Determine whether the expected category is consistent with the actual category , Depending on the match, decide whether to increase the value of the calculator :

for i in range(len(results)):    print('Right Species = ', test_target[i], \        ', \tReal Species = ', results[i])    if results[i] == test_target[i]:        right += 1

such , We can also get K The accuracy of the nearest neighbor algorithm , And print :

right_rate = right / len(results)print("Right Rate: ", right_rate)

To this step , The implementation of the whole algorithm is completed .

Next we can run it and see .

2.5 Run the example

Here is an example of the result of the run :

PS D:\justdopython\KNN> python KNN.py-------------------------------------------------------------------------------- Please enter  K  value :5--------------------------------------------------------------------------------Right Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-setosa ,  Real Species =  Iris-setosaRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-virginicaRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-virginicaRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-versicolor ,      Real Species =  Iris-versicolorRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginicaRight Species =  Iris-virginica ,       Real Species =  Iris-virginica--------------------------------------------------------------------------------Right Rate:  0.9333333333333333--------------------------------------------------------------------------------

3. summary

This article gives you a preliminary understanding of the so-called K Nearest neighbor algorithm , And give the concrete algorithm step by step . If some students feel that it's a bit boring to pile up the code slowly , In this paper “ Sample code ” The complete implementation of the above code is also given in , You can go to “ Sample code ” Download the corresponding code directly and run it to see the result , It might be more interesting .

When you get interested , I hope readers can re implement the code described in this article .

in addition , Considering that there may be some differences in the size range of sepals and petals , In fact, there needs to be a “ normalization ” Steps for , This step is omitted in this paper . Readers can do it on their own, and then look at the results .

Sample code :https://github.com/JustDoPython/python-100-day/tree/master/

Reference material

https://book.douban.com/subject/26708119/https://book.douban.com/subject/10590856/https://www.kaggle.com/uciml/irishttps://blog.csdn.net/pipisorry/article/details/51822775https://cloud.tencent.com/developer/ask/153164 Series articles

   The first 116 God : Naive Bayesian theory of machine learning algorithms

   The first 115 God :Python Is it value passing or reference passing

   The first 114 God : Three board model algorithm project actual combat

   The first 113 God :Python XGBoost Algorithm project actual combat

   The first 112 God : Monte Carlo of machine learning algorithm

   The first 111 God :Python Garbage collection mechanism

from 0 Study Python 0 - 110 Summary of the grand collection