picture


In this paper, let's learn another machine learning algorithm that we often hear —— K Mean clustering .

The name is really related to “K a near neighbor ” It's a bit like , But make it clear that ,“K a near neighbor ” Medium “K”, refer to “ Closest to the input data K Data points ”; and “K Mean clustering ” Medium K, It means “ Divide a stack of unlabeled data into K Categories ”, Among them the “ Category ” Usually called “ cluster ”(cluster), A cluster of two flowers .

and “ mean value ” It's more straightforward : The mean is the mean value of . That is, the average value of each cluster of data , This average value can be used as the center of this cluster of data , It is used to judge the difference between other data and the cluster data .

K The principle of mean clustering is easy to understand ,“ clustering ” Namely “ Put some data together by category ”, What's revealed is that “ Birds of a feather flock together , Birds of a feather flock together ” The simple philosophy of . The key to the algorithm is : What criteria should we use to determine that one data belongs to the same category as other data , It's a cluster ?

We introduced “ Distance metric ” The concept of , If necessary, you can refer to . stay K In the process of mean clustering , The way we use to tell the difference between the data is to look at the difference between the two “ distance ”.

1. Algorithm implementation

1.1 initialization

This time we use a class to implement K Mean clustering algorithm .

Use K Mean clustering , First you need to initialize the model , That is, according to the number of clusters to be divided K, Randomly selected from given data K Data , As the initial cluster centers .

Another thing to mention is , When we usually implement machine learning algorithms , We also need to analyze the training data “ Preprocessing ”, One of them is called “ Data normalization ” The operation of . The goal of this step is to unify the data into the same numerical range , Make different data comparable ; otherwise , For two lengths 、 The weights are (20 m, 10000 g)、(1 m, 10500 g) In terms of data , The difference in length is easily masked by the difference in weight . But the good thing is that we use iris The data used in the data set is about an order of magnitude , There may be some flaws , But it doesn't affect the implementation of our demonstration algorithm , So I skipped this step .

import numpy as npimport pandas as pdimport random

class Clusters():    def __init__(self, train_data, K):        '''        :params train_data: ndarray. Training data .        :params K: int. The number of clusters to be divided .        '''        super().__init__()
       self.train_data = train_data        self.K = K        # Mark whether clustering is complete . Concrete true and false , Depending on whether there is still data that needs to be moved from one cluster to another        self.finished = False
       # Random selection K Data as the center of each cluster        index = random.sample(range(len(self.train_data)), self.K)        self.centroid = train_data[index, 1:5]
       # The training data is evenly distributed to each cluster , In order to apply the same form to the distribution of data        self.clusters = []        offset = len(train_data) // self.K        for i in range(self.K):            start = offset * i            if i < self.K-1:                self.clusters.append(train_data[start:start+offset,:])            else:                # The last cluster contains all the remaining data                self.clusters.append(train_data[start:,:])    # Load the data set you want to use    @staticmethod    def getData():        '''         get data , The return value type is ndarray        '''        train_data = pd.read_csv('iris.csv').to_numpy()
       return train_data

The initialization method here needs to accept two parameters from the outside :train_data and K.train_data Data sets that need to be clustered , The data type is ndarray; and K Then accept the number of clusters to be divided , The data type is int.

The first 19 ~ 21 OK, we use random Random sampling function in the module , from 0 ~149 total 150 Randomly selected from a number of integers K An integer as an index , The corresponding data is our randomly initialized cluster centers .

The first 23 ~ 32 Line press K The size of the training data will be evenly distributed to each cluster , This step is actually to make the attribute clusters The format of is the same as the following “ Distribute ” Link to adapt .

The first 34 ~ 42 Line defines a static method , Used to get the training data to be used .

1.2 Distribute

In this step, we need to calculate the distance between each data and each cluster center , Then the cluster with the smallest distance from each data is selected as the target cluster , Move the data to the target cluster , At the same time, delete the corresponding data in the original cluster .

In other words, the requirement of this step is : Corresponding to the center of each cluster , All data should be in its place , Belong to the most appropriate cluster .

Here is the implementation code of the method :

    #  Assign the data to each cluster     def assign(self):        self.finished = True        # data_index_list  and  target_index_list  Record separately “ Index of the data to be moved in the current cluster ” as well as “ Target cluster index to move to ”        target_index_list = []        data_index_list = []        for i in range(self.K):            target_index_list.append([])            data_index_list.append([])
       for cluster_index in range(len(self.clusters)):            for data_index in range(len(self.clusters[cluster_index])):                diff = self.clusters[cluster_index][data_index, 1:5] - self.centroid                distance_square = np.sum(diff * diff, axis=1)                target_index = np.argmin(distance_square)
               if cluster_index != target_index:                    self.finished = False                    target_index_list[cluster_index].append(target_index)                    data_index_list[cluster_index].append(data_index)        for cluster_index in range(self.K):            for index in range(len(target_index_list[cluster_index])):                target_index = target_index_list[cluster_index][index]                data_index = data_index_list[cluster_index][index]
               self.clusters[target_index] = np.append(self.clusters[target_index],                                                        self.clusters[cluster_index][data_index, :]).reshape(-1, 6)
       for cluster_index in range(self.K):            data_index = data_index_list[cluster_index]            self.clusters[cluster_index] = np.delete(self.clusters[cluster_index], data_index, axis=0)

In order to show K The underlying logic of mean clustering , So try not to use ready-made code in code implementation , But step by step . Due to limited level , Therefore, the above code may not be compact enough , If you have a better realization , Welcome to feedback . Thank you first .

In the code of 5、6 Two lines , We created two empty lists , Used to record “ Index of the data to be moved in the original cluster ”(data_index_list), as well as “ The data to be moved should be moved to the target cluster ”(target_index_list).

In the 7 ~ 9 Line initialization , We press K Value initializes the two lists to have K A nested list of list elements , This saves one for indicating “ The original cluster of the data to be moved ” The data of , Only the first level index of these two lists can determine which cluster to move from / Delete data .

The first 11 ~ 15 Row is the code to calculate the distance between the data point and the center of each cluster , There's no need to elaborate . What needs to be mentioned is that , Because finding the square root of a value does not affect the relative size of the value , So what we use for comparison is actually “ The square of the distance ”, Not the real “ distance ”. In this way, when the amount of data is large, the amount of calculation can be slightly reduced .

The first 17 ~ 20 A row is to determine whether the data is in the cluster it should be in , If not, it means that the data should be moved to another cluster . Because the loop traversal of all data does not necessarily end , So here we can't directly analyze the original clusters Add and delete data , Just record the target cluster and data index that need to be moved , Modify the original data after traversal .

Easy to judge , We need to increase , Delete again , Otherwise, it is likely to cause the index data to be publicized , An error occurred . The first  22 ~ 32   OK, that's the job , First, add the data to the target cluster to be moved to , Then delete the corresponding cluster from the original cluster . And the deletion should either be from the back to the front ; Or you should choose all at once , At the same time to delete . We use the latter method , That is to use a list as an index , Select all the data to be deleted .

1.3 to update

This step is also well understood , It's also very easy to implement .

In short, it needs to be based on the content of the new cluster , Recalculate the center of each cluster . We use Numpy The average function is provided to realize . The code is as follows :

    #  Update the centroid of each cluster     def update(self):        for cluster_index in range(len(self.clusters)):            self.centroid[cluster_index] = np.mean(self.clusters[cluster_index][:,1:5], axis=0)

Here we are ,K The whole steps of mean clustering algorithm have been realized .

2. Increase availability

2.1 Training

For convenience , We also Clusters Class defines a train Method , It is used to complete the training process of the algorithm directly .

This method is actually encapsulating assign and update Two methods , There's no other functional addition .

    def train(self):        '''         Clustering training         '''        while not self.finished:            self.assign()            self.update()        print(' Training done !!!')

It is worth mentioning that , We use a name finished Boolean variable to identify when the training ends .

In the first place ,finished The value of is initialized to False Of . Enter the distribution (assign) Step by step , First, set it to True, That is, assume that the training process has been completed . Once the data needs to be moved from the current cluster to another cluster in the allocation process , I'll take it right away finished Set as False, That means the training process is still going on , There is no end .

2.2 Print the results

    def printResult(self):        '''         Print clustering results         '''        print('-'*80)        print('*'*80)        print('-'*80)        print('*'*30, ' Clustering results ', '*'*30)        print('-'*30,' Each cluster center ','-'*30)        for i in range(self.K):            print(' The first ', str(i), ' Cluster center :', self.centroid[i])        print('-'*80)        print('-'*30,' Each cluster results ','-'*30)        for i in range(self.K):            print('-'*20, ' The first ', str(i), ' Cluster results ', '-'*20,)            for d in self.clusters[i]:                print(d[5])
       print('-'*80)        print('*'*80)        print('-'*80)

This method is just a summary of the final results , And formatted and printed as readable text .

2.3 call Clusters

print('-'*80)K = int(input(' Please enter the number of clusters to be divided ( Expected a positive integer ):'))data = Clusters.getData()clusters = Clusters(data, K)clusters.train()clusters.printResult()

3. Running results

-------------------------------------------------------------------------------- Please enter the number of clusters to be divided ( Expected a positive integer ):3 Training done !!!--------------------------------------------------------------------------------********************************************************************************--------------------------------------------------------------------------------******************************  Clustering results  ******************************------------------------------  Each cluster center  ------------------------------ The first  0  Cluster center :[5.005999999999999 3.4180000000000006 1.464 0.2439999999999999] The first  1  Cluster center :[6.853846153846153 3.0769230769230766 5.715384615384615 2.053846153846153]  The first  2  Cluster center :[5.883606557377049 2.740983606557377 4.388524590163935 1.4344262295081964] --------------------------------------------------------------------------------------------------------------  Each cluster results  --------------------------------------------------  The first  0  Cluster results  --------------------Iris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosaIris-setosa--------------------  The first  1  Cluster results  --------------------Iris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginicaIris-virginica--------------------  The first  2  Cluster results  --------------------Iris-virginicaIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-versicolorIris-versicolorIris-versicolorIris-versicolorIris-virginicaIris-versicolorIris-versicolorIris-virginicaIris-virginicaIris-versicolorIris-virginicaIris-virginicaIris-virginicaIris-versicolorIris-virginicaIris-virginicaIris-versicolorIris-virginicaIris-virginicaIris-virginica--------------------------------------------------------------------------------********************************************************************************--------------------------------------------------------------------------------

4. summary

To avoid piling up formulas , So this article takes the available code as an example , The typical machine learning algorithms are explained step by step K The realization process of mean clustering . In this paper, the code for personal implementation , So there are inevitably problems , Please also point out that .

Thank you for growing up together .

Sample code :https://github.com/JustDoPython/python-100-day/tree/master/

Reference material

https://book.douban.com/subject/26708119/ https://book.douban.com/subject/33437381/ https://www.kaggle.com/uciml/iris https://blog.csdn.net/pipisorry/article/details/51822775

Series articles


The first 119 God :Python Crawling for Douban movie top 250
The first 118 God :Python Compare with the copy of the object
The first 117 God : Machine learning algorithms K a near neighbor
The first 116 God : Naive Bayesian theory of machine learning algorithms
The first 115 God :Python Is it value passing or reference passing

   The first 114 God : Three board model algorithm project actual combat

   The first 113 God :Python XGBoost Algorithm project actual combat

   The first 112 God : Monte Carlo of machine learning algorithm

   The first 111 God :Python Garbage collection mechanism

from 0 Study Python 0 - 110 Summary of the grand collection