编程知识 cdmana.com

Alleviating sample imbalance in multi classification

This article was first published in : Walker AI

Using deep learning to do multi classification is a common task in industry or research environment . In a research environment , Whether it's NLP、CV or TTS Series of tasks , The data is rich and clean . And in a real industrial environment , Data problems often become a big problem for practitioners ; Common data problems include :

  • The data sample size is small
  • Lack of data tagging
  • The data is not clean , There are a lot of disturbances
  • The distribution of sample number among data classes is not balanced and so on .

besides , There are other problems , This article will not list them one by one . For the above 4 A question ,2020 year 7 month google publish one’s thesis 《 Long-Tail Learning via Logit Adjustment 》 adopt BER ( Balanced Error Rate ) The related reasoning of cross entropy function , On the basis of the original cross entropy , So that the average classification accuracy is higher . This paper will briefly interpret the core inference of this paper , And use keras Deep learning framework to achieve , Finally, through a simple Mnist Experimental results of handwritten numeral classification . This article will be interpreted from the following four aspects :

  • Basic concepts
  • Core inference
  • Code implementation
  • experimental result

1. Basic concepts

In the multi classification problem based on deep learning , In order to get better classification results, we often need to analyze the data 、 The structural parameters of neural networks 、 The loss function and training parameters are adjusted ; Especially in the face of data with imbalanced categories , Make more adjustments . In the paper 《 Long-Tail Learning via Logit Adjustment 》 in , In order to alleviate the problem of low classification accuracy caused by imbalanced categories , By adding the prior knowledge of the label to the loss function, we get SOTA effect . therefore , This paper aims at its core inference , First of all, four basic concepts are briefly described :(1) Long tail distribution 、(2)softmax、(3) Cross entropy 、(4)BER

1.1 Long tail distribution

If the training data of all categories are sorted from high to low according to the sample size of each category , And show the sorting result on the graph , Then the class imbalance training data will show “ Head ” and “ The tail ” The distribution form of , As shown in the figure below :

Category formation with large sample size “ Head ” , The category with low sample size is formed “ The tail ” , The problem of class imbalance is significant .

1.2 softmax

softmax Because of its normalization function and easy derivation , It is often used as the activation function of the last layer of neural network in two or more classification problems , It is used to express the prediction output of neural network . This paper deals with softmax Don't go over it , Only the generalized formula is given :

q ( c j ) = e z j i = 1 n e z i q\left(c_{j}\right)=\frac{e^{z_{j}}}{\sum_{i=1}^{n} e^{z_{i}}}

In the neural network , z j z_{j} It's the output of the upper layer ; q ( c j ) q\left(c_{j}\right) It is the distribution form of the output of this layer ; i = 1 n e z i \sum_{i=1}^{n} e^{z_{i}} It's a batch Inside e z i e^{z_{i}} And .

1.3 Cross entropy

In this paper, we don't make too many inferences about the cross entropy function , For details, please refer to the relevant literature of information theory . In the problem of two or more classifications , The cross entropy function and its variants are usually used as the loss function for optimization , Give the basic formula :

H ( p , q ) = i p ( c i ) log q ( c i ) H(p, q)=-\sum_{i} p\left(c_{i}\right) \log q\left(c_{i}\right)

In the neural network , p ( c i ) p\left(c_{i}\right) Is the expected sample distribution , Usually one-hot Coded tags ; q ( c i ) q\left(c_{i}\right) Is the output of neural network , It can be regarded as the prediction result of neural network to samples .

1.4 BER

BER In the second classification, it is the mean value of prediction error rate in positive samples and negative samples ; In the multi classification problem, it is the weighted sum of the error rates of all kinds of samples , It can be expressed in the following form ( Refer to the paper ):

BER ( f ) 1 L y [ L ] P x y ( y argmax y y f y ( x ) ) \operatorname{BER}(f) \doteq \frac{1}{L} \sum_{y \in[L]} \mathbb{P}_{x \mid y}\left(y \notin \operatorname{argmax}_{y^{\prime} \in y} f_{y^{\prime}}(x)\right)

among , f f It's the whole neural network ; f y ( x ) f_{y^{\prime}}(x) Indicates that the input is x x , Output is y y^{\prime} The neural network of ; y argmax y y f y ( x ) y \notin \operatorname{argmax}_{y^{\prime} \in y} f_{y^{\prime}}(x) Represents the label that is wrongly recognized by the neural network y y ; P x y \mathbb{P}_{x \mid y} It is the calculation form of error rate ; 1 L \frac{1}{L} For all kinds of weights .

2. Core inference

According to the idea of the paper , First, a neural network model is determined :

f argmin f : x R L  BER  ( f ) f^{*} \in \operatorname{argmin}_{f:} x \rightarrow \mathbb{R}^{L} \text { BER }(f)

namely f f^{*} To satisfy BER A neural network model of conditions . Then optimize this neural network model argmax y [ L ] f y ( x ) \operatorname{argmax}_{y \in[L]} f_{y}^{*}(x) , This process is equivalent to argmax y [ L ] P b a l ( y x ) \operatorname{argmax}_{y \in[L]} \mathbb{P}^{\mathrm{bal}}(y \mid x) , That is, given the training data x x Get the prediction tag y y , And the prediction label y y equalization ( Multiply by their respective weights ) Optimization process of . Shorthand for :

argmax y [ L ] f y ( x ) = argmax y [ L ] P b a l ( y x ) = argmax y [ L ] P ( x y ) \operatorname{argmax}_{y \in[L]} f_{y}^{*}(x)=\operatorname{argmax}_{y \in[L]} \mathbb{P}^{\mathrm{bal}}(y \mid x)=\operatorname{argmax}_{y \in[L]} \mathbb{P}(x \mid y)

about P bal  ( y x ) \mathbb{P}^{\text {bal }}(y \mid x) , obviously P bal  ( y x ) P ( y x ) / P ( y ) \mathbb{P}^{\text {bal }}(y \mid x) \propto \mathbb{P}(y \mid x) / \mathbb{P}(y) , among P ( y ) \mathbb{P}(y) It's a label priori ; P ( y x ) \mathbb{P}(y \mid x) It's given training data x x And then the conditional probability of the prediction label . Combined with the essence of training in multi classification neural network :

According to the above process , Let's say the network outputs logits Write it down as s*: s : x R L s^{*}: x \rightarrow \mathbb{R}^{L} , because s s^{*} Need to pass through softmax Activation layer , namely q ( c i ) = e s i = 1 n e s q\left(c_{i}\right)=\frac{e^{s^{*}}}{\sum_{i=1}^{n} e^{s^{*}}} ; So it's not hard to come up with : P ( y x ) exp ( s y ( x ) ) \mathbb{P}(y \mid x) \propto \exp \left(s_{y}^{*}(x)\right) . combining P bal  ( y x ) P ( y x ) / P ( y ) \mathbb{P}^{\text {bal }}(y \mid x) \propto \mathbb{P}(y \mid x) / \mathbb{P}(y) , Can be P bal  ( y x ) \mathbb{P}^{\text {bal }}(y \mid x) Expressed as :

argmax y [ L ] P bal  ( y x ) = argmax y [ L ] exp ( s y ( x ) ) / P ( y ) = argmax y [ L ] s y ( x ) ln P ( y ) \operatorname{argmax}_{y \in[L]} \mathbb{P}^{\text {bal }}(y \mid x)=\operatorname{argmax}_{y \in[L]} \exp \left(s_{y}^{*}(x)\right) / \mathbb{P}(y)=\operatorname{argmax}_{y \in[L]} s_{y}^{*}(x)-\ln \mathbb{P}(y)

Refer to the above formula , The paper gives the optimization P bal  ( y x ) \mathbb{P}^{\text {bal }}(y \mid x) Two ways of implementing :

(1) adopt argmax y [ L ] exp ( s y ( x ) ) / P ( y ) \operatorname{argmax}_{y \in[L]} \exp \left(s_{y}^{*}(x)\right) / \mathbb{P}(y) , In the input x x Through all the neural network layers to get predictions predict after , Divide by a priori P ( y ) \mathbb{P}(y) . This method has been used before , And achieved certain results .

(2) adopt argmax y [ L ] s y ( x ) ln P ( y ) \operatorname{argmax}_{y \in[L]} s_{y}^{*}(x)-\ln \mathbb{P}(y) , In the input x x Get a code through the neural network layer logits Then subtract one ln P ( y ) \ln \mathbb{P}(y) . This paper adopts this idea .

Follow the second line of thought , In this paper, we give a general formula directly , be called logit adjustment loss:

( y , f ( x ) ) = log e f y ( x ) + τ log π y y [ L ] e f y ( x ) + τ log π y = log [ 1 + y y ( π y π y ) τ e ( f y ( x ) f y ( x ) ) ] \ell(y, f(x))=-\log \frac{e^{f_{y}(x)+\tau \cdot \log \pi_{y}}}{\sum_{y^{\prime} \in[L]} e^{f_{y^{\prime}}(x)+\tau \cdot \log \pi_{y^{\prime}}}}=\log \left[1+\sum_{y^{\prime} \neq y}\left(\frac{\pi_{y^{\prime}}}{\pi_{y}}\right)^{\tau} \cdot e^{\left(f_{y^{\prime}}(x)-f_{y}(x)\right)}\right]

Compared with the regular softmax Cross entropy :

( y , f ( x ) ) = log [ y [ L ] e f y ( x ) ] f y ( x ) = log [ 1 + y y e f y ( x ) f y ( x ) ] \ell(y, f(x))=\log \left[\sum_{y^{\prime} \in[L]} e^{f_{y^{\prime}}(x)}\right]-f_{y}(x)=\log \left[1+\sum_{y^{\prime} \neq y} e^{f_{y^{\prime}}(x)-f_{y}(x)}\right]

Essentially, an offset associated with the label prior is applied to each logarithmic output ( That is, through softmax The result before activation ).

3. Code implementation

The idea of realization is : The output of the neural network logits Plus a priori based offset log ( π y π y ) τ \log \left(\frac{\pi_{y^{\prime}}}{\pi_{y}}\right)^{\tau} . In practice, , In order to make it as simple and effective as possible , Take the regulatory factor τ \tau =1, π y \pi_{y^{\prime}} =1. be logit adjustment loss Simplified as :

( y , f ( x ) ) = log e f y ( x ) + τ log π y y [ L ] e f y ( x ) + τ log π y = log [ 1 + y y e ( f y ( x ) f y ( x ) log π y ) ] \ell(y, f(x))=-\log \frac{e^{f_{y}(x)+\tau \cdot \log \pi_{y}}}{\sum_{y^{\prime} \in[L]} e^{f_{y^{\prime}}(x)+\tau \cdot \log \pi_{y^{\prime}}}}=\log \left[1+\sum_{y^{\prime} \neq y} e^{\left(f_{y^{\prime}}(x)-f_{y}(x)-\log \pi_{y}\right)}\right]

stay keras The implementation is as follows :

import keras.backend as K

def CE_with_prior(one_hot_label, logits, prior, tau=1.0):
    '''
    param: one_hot_label
    param: logits
    param: prior: real data distribution obtained by statistics
    param: tau: regulator, default is 1
    return: loss
    '''   
    log_prior = K.constant(np.log(prior + 1e-8))

    # align dim 
    for _ in range(K.ndim(logits) - 1):     
        log_prior = K.expand_dims(log_prior, 0)

    logits = logits + tau * log_prior
    loss = K.categorical_crossentropy(one_hot_label, logits, from_logits=True)

    return loss
 Copy code 

4. experimental result

The paper 《 Long-Tail Learning via Logit Adjustment 》 In this paper, we compare several methods to improve the classification accuracy of long tail distribution , And tested with different data sets , Test performance is better than existing methods , Detailed experimental results refer to the paper itself . In order to quickly verify the correctness of the implementation , And the effectiveness of this method , Use mnist A simple experiment of handwritten numeral classification is carried out . The background of the experiment is as follows :

\ details
The training sample 0 ~ 4:5000 Zhang / class ;5 ~ 9 :500 Zhang / class
Test samples 0 ~ 9:500/ class
Running environment Local CPU
Network structure Convolution + Maximum pooling + Full connection

In the above background, comparative experiments are carried out , Compared with the standard multi classification cross entropy and the cross entropy with a priori, they are used as loss Function , The performance of classified networks . Take the same epoch=60, The experimental results are as follows :

\ Standard multi class cross entropy Cross entropy with a priori
Accuracy rate 0.9578 0.9720
Training flow chart

PS: More technical dry goods , Quick attention 【 official account | xingzhe_ai】, Discuss it with the traveler !

版权声明
本文为[Walker AI]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/01/20210128041721946t.html

Scroll to Top