In this paper, let's learn another machine learning algorithm that we often hear —— K Mean clustering .

The name is really related to “K a near neighbor ” It's a bit like , But make it clear that ,“K a near neighbor ” Medium “K”, refer to “ Closest to the input data K Data points ”; and “K Mean clustering ” Medium K, It means “ Divide a stack of unlabeled data into K Categories ”, Among them the “ Category ” Usually called “ cluster ”（cluster）, A cluster of two flowers .

and “ mean value ” It's more straightforward ： The mean is the mean value of . That is, the average value of each cluster of data , This average value can be used as the center of this cluster of data , It is used to judge the difference between other data and the cluster data .

K The principle of mean clustering is easy to understand ,“ clustering ” Namely “ Put some data together by category ”, What's revealed is that “ Birds of a feather flock together , Birds of a feather flock together ” The simple philosophy of . The key to the algorithm is ： What criteria should we use to determine that one data belongs to the same category as other data , It's a cluster ？

We introduced “ Distance metric ” The concept of , If necessary, you can refer to . stay K In the process of mean clustering , The way we use to tell the difference between the data is to look at the difference between the two “ distance ”.

## 1. Algorithm implementation

### 1.1 initialization

This time we use a class to implement K Mean clustering algorithm .

Use K Mean clustering , First you need to initialize the model , That is, according to the number of clusters to be divided K, Randomly selected from given data K Data , As the initial cluster centers .

Another thing to mention is , When we usually implement machine learning algorithms , We also need to analyze the training data “ Preprocessing ”, One of them is called “ Data normalization ” The operation of . The goal of this step is to unify the data into the same numerical range , Make different data comparable ; otherwise , For two lengths 、 The weights are （20 m, 10000 g）、（1 m, 10500 g） In terms of data , The difference in length is easily masked by the difference in weight . But the good thing is that we use `iris` The data used in the data set is about an order of magnitude , There may be some flaws , But it doesn't affect the implementation of our demonstration algorithm , So I skipped this step .

``import numpy as np``import pandas as pd``import random``class Clusters():``    def __init__(self, train_data, K):``        '''``        :params train_data: ndarray.  Training data .``        :params K: int.  The number of clusters to be divided .``        '''``        super().__init__()``        self.train_data = train_data``        self.K = K``        #  Mark whether clustering is complete . Concrete true and false , Depending on whether there is still data that needs to be moved from one cluster to another ``        self.finished = False``        #  Random selection  K  Data as the center of each cluster ``        index = random.sample(range(len(self.train_data)), self.K)``        self.centroid = train_data[index, 1:5]``        #  The training data is evenly distributed to each cluster , In order to apply the same form to the distribution of data ``        self.clusters = []``        offset = len(train_data) // self.K``        for i in range(self.K):``            start = offset * i``            if i < self.K-1:``                self.clusters.append(train_data[start:start+offset,:])``            else:``                #  The last cluster contains all the remaining data ``                self.clusters.append(train_data[start:,:])`` ``    #  Load the data set you want to use ``    @staticmethod``    def getData():``        '''``         get data , The return value type is  ndarray``        '''``        train_data = pd.read_csv('iris.csv').to_numpy()``        return train_data``

The initialization method here needs to accept two parameters from the outside ：train_data and K.train_data Data sets that need to be clustered , The data type is ndarray; and K Then accept the number of clusters to be divided , The data type is int.

The first 19 ~ 21 OK, we use `random` Random sampling function in the module , from 0 ~149 total 150 Randomly selected from a number of integers K An integer as an index , The corresponding data is our randomly initialized cluster centers .

The first 23 ~ 32 Line press K The size of the training data will be evenly distributed to each cluster , This step is actually to make the attribute `clusters` The format of is the same as the following “ Distribute ” Link to adapt .

The first 34 ~ 42 Line defines a static method , Used to get the training data to be used .

### 1.2 Distribute

In this step, we need to calculate the distance between each data and each cluster center , Then the cluster with the smallest distance from each data is selected as the target cluster , Move the data to the target cluster , At the same time, delete the corresponding data in the original cluster .

In other words, the requirement of this step is ： Corresponding to the center of each cluster , All data should be in its place , Belong to the most appropriate cluster .

Here is the implementation code of the method ：

``    #  Assign the data to each cluster ``    def assign(self):``        self.finished = True``        # data_index_list  and  target_index_list  Record separately “ Index of the data to be moved in the current cluster ” as well as “ Target cluster index to move to ”``        target_index_list = []``        data_index_list = []``        for i in range(self.K):``            target_index_list.append([])``            data_index_list.append([])``        for cluster_index in range(len(self.clusters)):``            for data_index in range(len(self.clusters[cluster_index])):``                diff = self.clusters[cluster_index][data_index, 1:5] - self.centroid``                distance_square = np.sum(diff * diff, axis=1)``                target_index = np.argmin(distance_square)``                if cluster_index != target_index:``                    self.finished = False``                    target_index_list[cluster_index].append(target_index)``                    data_index_list[cluster_index].append(data_index)`` ``        for cluster_index in range(self.K):``            for index in range(len(target_index_list[cluster_index])):``                target_index = target_index_list[cluster_index][index]``                data_index = data_index_list[cluster_index][index]``                self.clusters[target_index] = np.append(self.clusters[target_index], ``                                                        self.clusters[cluster_index][data_index, :]).reshape(-1, 6)``        for cluster_index in range(self.K):``            data_index = data_index_list[cluster_index]``            self.clusters[cluster_index] = np.delete(self.clusters[cluster_index], data_index, axis=0)``

In order to show K The underlying logic of mean clustering , So try not to use ready-made code in code implementation , But step by step . Due to limited level , Therefore, the above code may not be compact enough , If you have a better realization , Welcome to feedback . Thank you first .

In the code of 5、6 Two lines , We created two empty lists , Used to record “ Index of the data to be moved in the original cluster ”（data_index_list）, as well as “ The data to be moved should be moved to the target cluster ”（target_index_list）.

In the 7 ~ 9 Line initialization , We press K Value initializes the two lists to have K A nested list of list elements , This saves one for indicating “ The original cluster of the data to be moved ” The data of , Only the first level index of these two lists can determine which cluster to move from / Delete data .

The first 11 ~ 15 Row is the code to calculate the distance between the data point and the center of each cluster , There's no need to elaborate . What needs to be mentioned is that , Because finding the square root of a value does not affect the relative size of the value , So what we use for comparison is actually “ The square of the distance ”, Not the real “ distance ”. In this way, when the amount of data is large, the amount of calculation can be slightly reduced .

The first 17 ~ 20 A row is to determine whether the data is in the cluster it should be in , If not, it means that the data should be moved to another cluster . Because the loop traversal of all data does not necessarily end , So here we can't directly analyze the original `clusters` Add and delete data , Just record the target cluster and data index that need to be moved , Modify the original data after traversal .

Easy to judge , We need to increase , Delete again , Otherwise, it is likely to cause the index data to be publicized , An error occurred . The first  22 ~ 32   OK, that's the job , First, add the data to the target cluster to be moved to , Then delete the corresponding cluster from the original cluster . And the deletion should either be from the back to the front ; Or you should choose all at once , At the same time to delete . We use the latter method , That is to use a list as an index , Select all the data to be deleted .

### 1.3 to update

This step is also well understood , It's also very easy to implement .

In short, it needs to be based on the content of the new cluster , Recalculate the center of each cluster . We use `Numpy` The average function is provided to realize . The code is as follows ：

`    #  Update the centroid of each cluster     def update(self):        for cluster_index in range(len(self.clusters)):            self.centroid[cluster_index] = np.mean(self.clusters[cluster_index][:,1:5], axis=0)`

Here we are ,K The whole steps of mean clustering algorithm have been realized .

## 2. Increase availability

### 2.1 Training

For convenience , We also `Clusters` Class defines a `train` Method , It is used to complete the training process of the algorithm directly .

This method is actually encapsulating `assign` and `update` Two methods , There's no other functional addition .

`    def train(self):        '''         Clustering training         '''        while not self.finished:            self.assign()            self.update()        print(' Training done ！！！')`

It is worth mentioning that , We use a name `finished` Boolean variable to identify when the training ends .

In the first place ,`finished` The value of is initialized to `False` Of . Enter the distribution （assign） Step by step , First, set it to `True`, That is, assume that the training process has been completed . Once the data needs to be moved from the current cluster to another cluster in the allocation process , I'll take it right away `finished` Set as `False`, That means the training process is still going on , There is no end .

### 2.2 Print the results

``    def printResult(self):``        '''``         Print clustering results ``        '''``        print('-'*80)``        print('*'*80)``        print('-'*80)``        print('*'*30, ' Clustering results ', '*'*30)``        print('-'*30,' Each cluster center ','-'*30)``        for i in range(self.K):``            print(' The first ', str(i), ' Cluster center ：', self.centroid[i])``        print('-'*80)``        print('-'*30,' Each cluster results ','-'*30)``        for i in range(self.K):``            print('-'*20, ' The first ', str(i), ' Cluster results ', '-'*20,)``            for d in self.clusters[i]:``                print(d[5])``        print('-'*80)``        print('*'*80)``        print('-'*80)``

This method is just a summary of the final results , And formatted and printed as readable text .

### 2.3 call Clusters

`print('-'*80)K = int(input(' Please enter the number of clusters to be divided （ Expected a positive integer ）：'))data = Clusters.getData()clusters = Clusters(data, K)clusters.train()clusters.printResult()`

## 3. Running results

`-------------------------------------------------------------------------------- Please enter the number of clusters to be divided （ Expected a positive integer ）：3 Training done ！！！--------------------------------------------------------------------------------********************************************************************************--------------------------------------------------------------------------------******************************  Clustering results  ******************************------------------------------  Each cluster center  ------------------------------ The first  0  Cluster center ：[5.005999999999999 3.4180000000000006 1.464 0.2439999999999999] The first  1  Cluster center ：[6.853846153846153 3.0769230769230766 5.715384615384615 2.053846153846153]  The first  2  Cluster center ：[5.883606557377049 2.740983606557377 4.388524590163935 1.4344262295081964] --------------------------------------------------------------------------------------------------------------  Each cluster results  --------------------------------------------------  The first  0  Cluster results  --------------------Iris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosa--------------------  The first  1  Cluster results  --------------------Iris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginica--------------------  The first  2  Cluster results  --------------------Iris-virginicaIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-versicolorIris-virginicaIris-virginicaIris-virginicaIris-versicolorIris-virginicaIris-virginicaIris-versicolorIris-virginicaIris-virginicaIris-virginica--------------------------------------------------------------------------------********************************************************************************--------------------------------------------------------------------------------`

## 4. summary

To avoid piling up formulas , So this article takes the available code as an example , The typical machine learning algorithms are explained step by step K The realization process of mean clustering . In this paper, the code for personal implementation , So there are inevitably problems , Please also point out that .

Thank you for growing up together .

Sample code ：https://github.com/JustDoPython/python-100-day/tree/master/

## Reference material

https://book.douban.com/subject/26708119/ https://book.douban.com/subject/33437381/ https://www.kaggle.com/uciml/iris https://blog.csdn.net/pipisorry/article/details/51822775

Series articles

The first 119 God ：Python Crawling for Douban movie top 250
The first 118 God ：Python Compare with the copy of the object
The first 117 God ： Machine learning algorithms K a near neighbor
The first 116 God ： Naive Bayesian theory of machine learning algorithms
The first 115 God ：Python Is it value passing or reference passing
from 0 Study Python 0 - 110 Summary of the grand collection