On that day , I heard the sound of seeds breaking through the ground , Subtle and firm .
machine learning | Decision tree ：C4.5 The algorithm deals with missing values
Sample sets with missing attribute values , How to learn decision tree ？ There are two problems that need to be solved ：
In the case of missing attribute values , How to select the partition attribute ？
Given partition properties , If the value of the sample on the attribute is missing , How to divide the samples （ To which attribute value Branch ）？
1 The choice of partition attributes （ problem 1）
In datasets without missing values , For an attribute a, The algorithm generally uses all its samples to judge the advantages and disadvantages of attributes . Naturally , For data sets with missing values , In a certain attribute a Next , Algorithm Can only be used in properties a A subset of samples without missing values （ It's the following ） To determine the attributes a The advantages and disadvantages of . that , Does the proportion of missing samples affect the selection of attributes ？
For datasets , If the sample has no missing value , Then the attribute with the largest information gain is taken as the optimal partition attribute ; If the sample has missing values , The information gain is ： The proportion of samples without missing values Multiply by the sample subset without missing values Information gain of . PS： This shows that , For an attribute a, If the missing ratio is larger , The less important the information gain is .
hypothesis Represent in dataset Properties of a On , A subset of samples without missing values ; It means that Properties of a The upper value is A subset of samples that are formed by the ; It means that Belong to No Sample subset of class . thus , The following variables can be determined （ At the beginning ）：
among , Represents the percentage of samples with no missing values ; Represents the second in the sample without missing values The proportion of classes ; Indicates that there are no missing values in the sample a The upper value is The proportion of the sample of .
that , The information gain formula of the sample set with missing values is updated to
Easy to know , In the process of recursion , If all samples contained in an attribute value have no missing values （ Naturally ）, Then the standard information gain formula is used in the selection of partition attributes （ Of course , there “ All samples ” Samples divided by missing attribute values should not be included , Because that leads to the weight of the sample , that It's not based on “ Counting frequency ” To calculate the scale ）.
2 Sample division （ problem 2）
The first section solves the problem of missing attribute values , How to select the partition attribute . that , After determining the partition attribute of the node , If the value of the sample on the attribute is missing , How to divide the samples （ To which attribute value Branch ）？
Because we don't know what the missing value is , So we divide the sample into all attribute branches ; however , After dividing into different branches , The weight of the sample It's not the same anymore —— This sample is divided into attribute values of The weight after the branch of will be updated to
According to “ transcendental ” To determine the importance of dividing into different branches , This sample is used in the next selection of partition attributes , Will participate with the updated weight value Recalculation of . similarly , For in property a Samples with no missing values on , When it is divided into sub nodes, the weight remains unchanged .
Sum up , problem 1 Solved the choice of partition attribute ; problem 2 The problem of dividing the samples with missing values is solved . thus , Each branch can be divided recursively by selecting the optimal partition attribute and dividing the sample （ Data sets with missing values ） The construction of decision tree .
3 How to classify samples with missing values
In the test phase , For test samples with missing attribute values , How to use the decision tree to classify it ？
If you enter an attribute node in the decision-making process , The attribute value of the test sample on this attribute is unknown , The algorithm will explore all the possible classification results and consider them together , Choose the one with the highest probability as the label .
Reference material ： zhou 《 machine learning 》, Zhao Ying .
Zhao Ying : https://blog.csdn.net/leaf_zizi