## 编程知识 cdmana.com

### Alleviating sample imbalance in multi classification

This article was first published in ： Walker AI

Using deep learning to do multi classification is a common task in industry or research environment . In a research environment , Whether it's NLP、CV or TTS Series of tasks , The data is rich and clean . And in a real industrial environment , Data problems often become a big problem for practitioners ; Common data problems include ：

• The data sample size is small
• Lack of data tagging
• The data is not clean , There are a lot of disturbances
• The distribution of sample number among data classes is not balanced and so on .

besides , There are other problems , This article will not list them one by one . For the above 4 A question ,2020 year 7 month google publish one’s thesis 《 Long-Tail Learning via Logit Adjustment 》 adopt BER ( Balanced Error Rate ) The related reasoning of cross entropy function , On the basis of the original cross entropy , So that the average classification accuracy is higher . This paper will briefly interpret the core inference of this paper , And use keras Deep learning framework to achieve , Finally, through a simple Mnist Experimental results of handwritten numeral classification . This article will be interpreted from the following four aspects ：

• Basic concepts
• Core inference
• Code implementation
• experimental result

### 1. Basic concepts

In the multi classification problem based on deep learning , In order to get better classification results, we often need to analyze the data 、 The structural parameters of neural networks 、 The loss function and training parameters are adjusted ; Especially in the face of data with imbalanced categories , Make more adjustments . In the paper 《 Long-Tail Learning via Logit Adjustment 》 in , In order to alleviate the problem of low classification accuracy caused by imbalanced categories , By adding the prior knowledge of the label to the loss function, we get SOTA effect . therefore , This paper aims at its core inference , First of all, four basic concepts are briefly described ：（1） Long tail distribution 、（2）softmax、（3） Cross entropy 、（4）BER

#### 1.1 Long tail distribution

If the training data of all categories are sorted from high to low according to the sample size of each category , And show the sorting result on the graph , Then the class imbalance training data will show “ Head ” and “ The tail ” The distribution form of , As shown in the figure below ：

Category formation with large sample size “ Head ” , The category with low sample size is formed “ The tail ” , The problem of class imbalance is significant .

#### 1.2 softmax

softmax Because of its normalization function and easy derivation , It is often used as the activation function of the last layer of neural network in two or more classification problems , It is used to express the prediction output of neural network . This paper deals with softmax Don't go over it , Only the generalized formula is given ：

$q\left(c_{j}\right)=\frac{e^{z_{j}}}{\sum_{i=1}^{n} e^{z_{i}}}$

In the neural network , $z_{j}$ It's the output of the upper layer ; $q\left(c_{j}\right)$ It is the distribution form of the output of this layer ; $\sum_{i=1}^{n} e^{z_{i}}$ It's a batch Inside $e^{z_{i}}$ And .

#### 1.3 Cross entropy

In this paper, we don't make too many inferences about the cross entropy function , For details, please refer to the relevant literature of information theory . In the problem of two or more classifications , The cross entropy function and its variants are usually used as the loss function for optimization , Give the basic formula ：

$H(p, q)=-\sum_{i} p\left(c_{i}\right) \log q\left(c_{i}\right)$

In the neural network , $p\left(c_{i}\right)$ Is the expected sample distribution , Usually one-hot Coded tags ; $q\left(c_{i}\right)$ Is the output of neural network , It can be regarded as the prediction result of neural network to samples .

#### 1.4 BER

BER In the second classification, it is the mean value of prediction error rate in positive samples and negative samples ; In the multi classification problem, it is the weighted sum of the error rates of all kinds of samples , It can be expressed in the following form （ Refer to the paper ）：

$\operatorname{BER}(f) \doteq \frac{1}{L} \sum_{y \in[L]} \mathbb{P}_{x \mid y}\left(y \notin \operatorname{argmax}_{y^{\prime} \in y} f_{y^{\prime}}(x)\right)$

among , $f$ It's the whole neural network ; $f_{y^{\prime}}(x)$ Indicates that the input is $x$, Output is $y^{\prime}$ The neural network of ; $y \notin \operatorname{argmax}_{y^{\prime} \in y} f_{y^{\prime}}(x)$ Represents the label that is wrongly recognized by the neural network $y$; $\mathbb{P}_{x \mid y}$ It is the calculation form of error rate ; $\frac{1}{L}$ For all kinds of weights .

### 2. Core inference

According to the idea of the paper , First, a neural network model is determined ：

$f^{*} \in \operatorname{argmin}_{f:} x \rightarrow \mathbb{R}^{L} \text { BER }(f)$

namely $f^{*}$ To satisfy BER A neural network model of conditions . Then optimize this neural network model $\operatorname{argmax}_{y \in[L]} f_{y}^{*}(x)$, This process is equivalent to $\operatorname{argmax}_{y \in[L]} \mathbb{P}^{\mathrm{bal}}(y \mid x)$, That is, given the training data $x$ Get the prediction tag $y$, And the prediction label $y$ equalization （ Multiply by their respective weights ） Optimization process of . Shorthand for ：

$\operatorname{argmax}_{y \in[L]} f_{y}^{*}(x)=\operatorname{argmax}_{y \in[L]} \mathbb{P}^{\mathrm{bal}}(y \mid x)=\operatorname{argmax}_{y \in[L]} \mathbb{P}(x \mid y)$

about $\mathbb{P}^{\text {bal }}(y \mid x)$, obviously $\mathbb{P}^{\text {bal }}(y \mid x) \propto \mathbb{P}(y \mid x) / \mathbb{P}(y)$, among $\mathbb{P}(y)$ It's a label priori ; $\mathbb{P}(y \mid x)$ It's given training data $x$ And then the conditional probability of the prediction label . Combined with the essence of training in multi classification neural network ：

According to the above process , Let's say the network outputs logits Write it down as s*： $s^{*}: x \rightarrow \mathbb{R}^{L}$, because $s^{*}$ Need to pass through softmax Activation layer , namely $q\left(c_{i}\right)=\frac{e^{s^{*}}}{\sum_{i=1}^{n} e^{s^{*}}}$; So it's not hard to come up with ： $\mathbb{P}(y \mid x) \propto \exp \left(s_{y}^{*}(x)\right)$. combining $\mathbb{P}^{\text {bal }}(y \mid x) \propto \mathbb{P}(y \mid x) / \mathbb{P}(y)$, Can be $\mathbb{P}^{\text {bal }}(y \mid x)$ Expressed as ：

$\operatorname{argmax}_{y \in[L]} \mathbb{P}^{\text {bal }}(y \mid x)=\operatorname{argmax}_{y \in[L]} \exp \left(s_{y}^{*}(x)\right) / \mathbb{P}(y)=\operatorname{argmax}_{y \in[L]} s_{y}^{*}(x)-\ln \mathbb{P}(y)$

Refer to the above formula , The paper gives the optimization $\mathbb{P}^{\text {bal }}(y \mid x)$ Two ways of implementing ：

（1） adopt $\operatorname{argmax}_{y \in[L]} \exp \left(s_{y}^{*}(x)\right) / \mathbb{P}(y)$ , In the input $x$ Through all the neural network layers to get predictions predict after , Divide by a priori $\mathbb{P}(y)$. This method has been used before , And achieved certain results .

（2） adopt $\operatorname{argmax}_{y \in[L]} s_{y}^{*}(x)-\ln \mathbb{P}(y)$ , In the input $x$ Get a code through the neural network layer logits Then subtract one $\ln \mathbb{P}(y)$. This paper adopts this idea .

Follow the second line of thought , In this paper, we give a general formula directly , be called logit adjustment loss：

$\ell(y, f(x))=-\log \frac{e^{f_{y}(x)+\tau \cdot \log \pi_{y}}}{\sum_{y^{\prime} \in[L]} e^{f_{y^{\prime}}(x)+\tau \cdot \log \pi_{y^{\prime}}}}=\log \left[1+\sum_{y^{\prime} \neq y}\left(\frac{\pi_{y^{\prime}}}{\pi_{y}}\right)^{\tau} \cdot e^{\left(f_{y^{\prime}}(x)-f_{y}(x)\right)}\right]$

Compared with the regular softmax Cross entropy ：

$\ell(y, f(x))=\log \left[\sum_{y^{\prime} \in[L]} e^{f_{y^{\prime}}(x)}\right]-f_{y}(x)=\log \left[1+\sum_{y^{\prime} \neq y} e^{f_{y^{\prime}}(x)-f_{y}(x)}\right]$

Essentially, an offset associated with the label prior is applied to each logarithmic output （ That is, through softmax The result before activation ）.

### 3. Code implementation

The idea of realization is ： The output of the neural network logits Plus a priori based offset $\log \left(\frac{\pi_{y^{\prime}}}{\pi_{y}}\right)^{\tau}$. In practice, , In order to make it as simple and effective as possible , Take the regulatory factor $\tau$=1, $\pi_{y^{\prime}}$=1. be logit adjustment loss Simplified as ：

$\ell(y, f(x))=-\log \frac{e^{f_{y}(x)+\tau \cdot \log \pi_{y}}}{\sum_{y^{\prime} \in[L]} e^{f_{y^{\prime}}(x)+\tau \cdot \log \pi_{y^{\prime}}}}=\log \left[1+\sum_{y^{\prime} \neq y} e^{\left(f_{y^{\prime}}(x)-f_{y}(x)-\log \pi_{y}\right)}\right]$

stay keras The implementation is as follows ：

import keras.backend as K

def CE_with_prior(one_hot_label, logits, prior, tau=1.0):
'''
param: one_hot_label
param: logits
param: prior: real data distribution obtained by statistics
param: tau: regulator, default is 1
return: loss
'''
log_prior = K.constant(np.log(prior + 1e-8))

# align dim
for _ in range(K.ndim(logits) - 1):
log_prior = K.expand_dims(log_prior, 0)

logits = logits + tau * log_prior
loss = K.categorical_crossentropy(one_hot_label, logits, from_logits=True)

return loss
Copy code 

### 4. experimental result

The paper 《 Long-Tail Learning via Logit Adjustment 》 In this paper, we compare several methods to improve the classification accuracy of long tail distribution , And tested with different data sets , Test performance is better than existing methods , Detailed experimental results refer to the paper itself . In order to quickly verify the correctness of the implementation , And the effectiveness of this method , Use mnist A simple experiment of handwritten numeral classification is carried out . The background of the experiment is as follows ：

\ details
The training sample 0 ~ 4：5000 Zhang / class ;5 ~ 9 :500 Zhang / class
Test samples 0 ~ 9：500/ class
Running environment Local CPU
Network structure Convolution + Maximum pooling + Full connection

In the above background, comparative experiments are carried out , Compared with the standard multi classification cross entropy and the cross entropy with a priori, they are used as loss Function , The performance of classified networks . Take the same epoch=60, The experimental results are as follows ：

\ Standard multi class cross entropy Cross entropy with a priori
Accuracy rate 0.9578 0.9720
Training flow chart

PS： More technical dry goods , Quick attention 【 official account | xingzhe_ai】, Discuss it with the traveler ！

https://cdmana.com/2021/01/20210128041721946t.html

Scroll to Top