All things can be Embedding The series will be introduced in combination with papers and practical experience , In the early stage, it mainly concentrated in the thesis , Practical experience and cases will be added later , Currently updated ：

- All things can be Vector The language model ： from N-Gram To NNLM、RNNLM
- All things can be Vector And Word2vec：2 A model 、2 Optimization and actual use of
- Item2vec It's worth savoring 8 A classic tricks and thinks
- Doc2vec The algorithm principle of 、 Code implementation and application inspiration
- Sentence2Vec & GloVe Algorithm principle 、 Deduction and Realization

It will be updated continuously in the future Embedding Related articles , Welcome to continue to pay attention 「 Search and recommend Wiki」

# Sentence2vec

Sentence2vec yes 2017 Published in ICLR（ International learning shows memories ） A paper on , The whole is called ：**A Simple but tough-to-beat baseline for sentence embeddings**

Let's take a look at the content of the paper （ The content of the paper is rather obscure , The level of small editing is not high , If not , Comment area , thank ！）.

## 1、 summary

This paper proposes an unsupervised , Calculate sentences based on word vector embedding Methods , be called Smooth Inverse Frequecy(SIF), Using the method of weighted average from word embedding To sentence embedding, And then based on the sentence embedding Do similarity calculation , Use Sentence2vec To replace the idea of the model .

The calculation method proposed in this paper is better than the direct average method embedding The effect of this method is better , In some missions, it's even better than RNN、LSTM The model should perform better .

Similar to the idea in this paper are ：

- The average direct use of words embedding To express a sentence , That is, average without weight
- Use TF-IDF As the weight value , Calculate with weight （ The premise is that there are not many repeated words in the sentence , And the use of tfidf Value as a weight has no theoretical basis ）

By introducing SIF, Combine words embedding Weighted sentences embedding, Not only is it better , And it has good robustness , Three points are mentioned in the paper ：

- Using the corpus of different fields to train different words embedding, All achieved good results , It shows that the algorithm is more friendly to all kinds of corpus
- Word frequency calculated by using different corpora , As the weight of words , It has little effect on the final result
- For super parameters in methods , On a large scale , The results are regionally consistent , That is to say, the choice of super parameters has little effect

## 2、 theory

### a） Latent variable generation model

Introducing Sentence2vec Before , Let's take a look at the latent variable generation model （latent variable generative model）, It regards the generation of corpus as a dynamic process , The first t t t One word is in the first t t t Step by step , Every word w w w For a real vector R d R^d Rd. This dynamic process is through discourse vector c t ∈ R d c_t \in R^d ct∈Rd Driven by random walk .discourse vector To say is what is being talked about.

discourse vector c t c_t ct and Vector of words v w v_w vw The inner product of is to express discourse and word The correlation between , And suppose t t t Time observation w w w The probability of is the logarithmic linear relation of this inner product （log linear）, Expression for ：

P r [ w e m i t t e d a t t i m e t ∣ c t ] ∝ e x p ( < c t , v w > ) Pr[w \, emitted at time t | c_t] \propto exp(<c_t, v_w>) Pr[wemittedattimet∣ct]∝exp(<ct,vw>)

because c t c_t ct It's a small random walk , c t c_t ct and c t + 1 c_{t+1} ct+1 It's just a very small random difference vector between them , So adjacent words consist of similar discourses vector Generate , In addition, the calculation shows that the random walk of this model allows occasional c t c_t ct There are bigger ones jump, The word vector generated by this method , And word2vec(CBOW) and Glove Is similar to that of .

### b）Sentence2vec Improvements on random walk

In a given sentence s s s Under the circumstances , For the vector that controls the sentence discourse vector Maximum likelihood estimation , We observe that when a sentence generates words ,discourse vector c t c_t ct The change is very small , For the sake of simplicity , It is considered to be fixed , by c s c_s cs, Can prove that Yes c s c_s cs The maximum likelihood estimate is the average of all word vectors in the sentence .

Sentence2vec The improvement of the model adds two smoothing terms （smoothing term）, as a result of ： Some words appear out of context , It may be true discourse vector An impact ; Some common stop words and discourse vector It hardly matters .

Two smoothing techniques are ：

- 1、 Introduce... Into the log linear model The cumulative term $ \alpha p(w) , , ,p(w)$ It means the word w w w The probability of occurrence in the whole corpus , α \alpha α It's a super parameter , So even with c s c_s cs The vector inner product of is very small , This word also has the probability to appear
- 2、 Introduce error correction terms c 0 ∈ R d c_0 \in R^d c0∈Rd (a common discourse vector), Its meaning is the most frequent meaning of a sentence, which can be regarded as the most important part of a sentence . It can often be associated with grammar . For a word , Along the c 0 c_0 c0 The composition of direction is bigger ( That is, the vector projection is longer ), This correction will increase the probability of the word appearing .

Corrected words w w w In a sentence s s s The probability of occurrence in is ：

P r [ w e m i t t e d i n s e n t e n c e s ∣ c s ] = α p ( w ) + ( 1 − α ) e x p ( < c s ~ , v w > ) Z c s ~ Pr[w \, emitted in sentence s | c_s] = \alpha p(w) + (1-\alpha) \frac{exp(<\tilde{c_s}, v_w>)}{Z_{\tilde{c_s}}} Pr[wemittedinsentences∣cs]=αp(w)+(1−α)Zcs~exp(<cs~,vw>)

among ：

- c s ~ = β c 0 + ( 1 − β ) c s , c 0 ⊥ c s \tilde{c_s} = \beta c_0 + (1- \beta) c_s, c_0 \perp c_s cs~=βc0+(1−β)cs,c0⊥cs
- $ \alpha, \beta$ Is a super parameter
- Z c s ~ = ∑ w ∈ V e x p ( < c s ~ , v w > ) Z_{\tilde{c_s}} = \sum_{w \in V} exp(<\tilde{c_s}, v_w>) Zcs~=∑w∈Vexp(<cs~,vw>) It's the normalization constant

It can also be seen from the above formula that , With a c s c_s cs Unrelated words w w w It can also appear in sentences , because ：

- α p ( w ) \alpha p(w) αp(w) Constant term
- And common discourse vector c 0 c_0 c0 The relevance of

### c） Calculate sentence correlation

The vector of the sentence , As mentioned above c s c_s cs It can be generated by the maximum likelihood function , Let's assume that the words that make up the sentence v w v_w vw It's unified and decentralized , So here is Z c Z_c Zc The values for different sentences are roughly the same , That is to say, for any c s ~ \tilde{c_s} cs~, Z Z Z The values are the same ,, Under this premise , The likelihood function obtained is ：

p [ s ∣ c s ] = ∏ w ∈ s p ( w ∣ c s ) = ∏ w ∈ s [ α p ( w ) + ( 1 − α ) e x p ( < v w , c s ~ > ) Z ] p[s | c_s] = \prod_{w\in s} p(w|c_s)= \prod_{w \in s} [\alpha p(w) + (1-\alpha) \frac{ exp(<v_w, \tilde{c_s}>) }{Z}] p[s∣cs]=w∈s∏p(w∣cs)=w∈s∏[αp(w)+(1−α)Zexp(<vw,cs~>)]

Take the logarithm , Available ：

f w ( c s ~ ) = l o g [ α p ( w ) + ( 1 − α ) e x p ( < v w , c s ~ > ) Z ] f_w(\tilde{c_s}) = log [\alpha p(w) + (1-\alpha) \frac{ exp(<v_w, \tilde{c_s}>) }{Z}] fw(cs~)=log[αp(w)+(1−α)Zexp(<vw,cs~>)]

After a series of derivations , The final objective function is ：

a r g m a x ∑ w ∈ s f w ( c s ~ ) arg max \sum_{w \in s} f_w(\tilde{c_s}) argmaxw∈s∑fw(cs~)

It is proportional to ：

∑ w ∈ s α p ( w ) + α v w \sum_{w \in s} \frac {\alpha}{p(w) + \alpha} v_w w∈s∑p(w)+ααvw

among α = 1 − α α Z \alpha = \frac{1-\alpha} {\alpha Z} α=αZ1−α

So you get :

- The optimal solution is the weighted average of all the word vectors in the sentence
- For words with higher frequency w w w, Less weight , Therefore, this method is also equivalent to down sampling frequent words

Last , In order to get the final sentence vector c s c_s cs, We need to estimate c 0 c_0 c0.. By calculating the vector c s ~ \tilde{c_s} cs~ Of first principal component(PCA The principal component of ), Regard it as c 0 c_0 c0, The final sentence vector is c s ~ \tilde{c_s} cs~ Subtract the principal component vector c 0 c_0 c0.

### d） The whole algorithm flow

## 3、 Realization

The author opened up the code , The address is ：https://github.com/PrincetonML/SIF

# Glove

## 1、 summary

In this paper, the author summarizes the current generation of embedding There are two main types of methods of , But both methods have their disadvantages

- Based on matrix factorization , The disadvantage is that it is based on the global construction of the matrix , For some high-frequency words , In the process of algorithm optimization , It has a large weight
- Based on sliding window , We can't directly model the co-occurrence of words in corpus , Using a sliding window , There is no way to use the co-occurrence information of the data

Therefore, the author proposes a corpus based information statistics method , And then it generates embedding The algorithm of -Glove. Let's take a look at the corresponding algorithm principle .

## 2、 Algorithm derivation process

The definition of characters ：

- X X X The co-occurrence matrix of a word , X i j X_{ij} Xij Express word j j j In the word i i i The number of occurrences in the context , That is, in the specified window , word i , j i,j i,j The co-occurrence times of
- X i X_i Xi Express Any word k k k And the words i i i The total number of co-occurrence times of , ∑ k X i k \sum_{k} X_{ik} ∑kXik
- P i j = P ( j ∣ i ) = X i j / X i P_{ij}=P(j|i)=X_{ij} / X_i Pij=P(j∣i)=Xij/Xi, Means the word i , j i,j i,j The number of co-occurrence in the word i i i The probability of the sum of occurrences

A simple example is given in the paper , How to extract meaningful information from co-occurrence matrix , As shown in the figure below ：

The message that the image above wants to illustrate is , If the word k k k And the words i i i The correlation ratio of k k k and j j j If there's a big correlation between , P ( i k ) / P ( j k ) P(ik) / P(jk) P(ik)/P(jk) It's going to be big , The greater the gap between , The greater the ratio ; Similarly, if k k k and i i i The correlation ratio of k k k and j j j If the correlation is small , P ( i k ) / P ( j k ) P(ik) / P(jk) P(ik)/P(jk) Will be very small , The greater the gap between , The smaller the ratio ; If k k k and i , j i, j i,j If you don't want to close , P ( i k ) / P ( j k ) P(ik) / P(jk) P(ik)/P(jk) Will be close to 1.

The above discussion shows that the learning of word vector should be based on the ratio of co-occurrence probability rather than probability itself , So let's say that we can do it by function F F F To learn the word vector , function F F F Can be abstracted as ：

F ( w i , w j , w ~ k ) = P i k P j k F(w_i, w_j, \tilde{w}_k) = \frac{P_{ik}} {P_{jk}} F(wi,wj,w~k)=PjkPik

among w w w It's a word vector , P i k P j k \frac{P_{ik}} {P_{jk}} PjkPik It can be calculated from the corpus , function F F F Depends on some parameters that have not been specified , But because of some necessary conditions , function F F F Can be further identified .

Because vector space has a linear structure , So we can do the difference on the word vector , function F F F Can be converted to ：

F ( w i − w j , w ~ k ) = P i k P j k F(w_i - w_j, \tilde{w}_k) = \frac{P_{ik}}{P_{jk}} F(wi−wj,w~k)=PjkPik

Although the function F It may be a more complicated structure , Like neural networks , But to do so , Will make the linear structure we're trying to capture disappear , So to avoid this problem , We can do an inner product of a vector first , Can be converted to ：

F ( ( w i − w j ) T w ~ k ) = P i k P j k F((w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}} F((wi−wj)Tw~k)=PjkPik

On the left side of the above formula is subtraction , On the right is division , It's easy to think of exponential operations , So the limiting function F F F It's an exponential function , At this time there is ：

e x p ( w i T w k − w j T w k ) = e x p ( w i T w k ) e x p ( w j T w k ) = P i k P j k exp(w_i^Tw_k - w_j^Tw_k) = \frac{exp(w_i^Tw_k)}{exp(w_j^Tw_k)} = \frac{P_{ik}}{P_{jk}} exp(wiTwk−wjTwk)=exp(wjTwk)exp(wiTwk)=PjkPik

here , Just make sure that the numerator and denominator on both sides of the equation are equal , namely ：

e x p ( w i T w k ) = P i k , e x p ( w j T w k ) = P j k exp(w_i^T w_k) = P_{ik}, exp(w_j^Tw_k) = P_{jk} exp(wiTwk)=Pik,exp(wjTwk)=Pjk

further , It can be transformed into all the words in the corpus , Investigate e x p ( w i T w k ) = P i k = X i k X i exp(w_i^T w_k) = P_{ik} = \frac{X_{ik}}{X_i} exp(wiTwk)=Pik=XiXik, namely ：

w i T w k = l o g ( X i k X i ) = l o g X i k − l o g X i w_i^T w_k = log (\frac{X_{ik}}{X_i}) = log X_{ik} - logX_i wiTwk=log(XiXik)=logXik−logXi

Because of the left side of the above w i T w k w_i^T w_k wiTwk in , Exchange i i i and k k k The value of does not change the result , That is, it has symmetry , therefore , To make sure that the right side of the equation is also symmetrical , Two bias terms are introduced , namely ：

w i T w k = l o g X i k − b i − b k w_i^T w_k = log X_{ik} - b_i - b_k wiTwk=logXik−bi−bk

here , l o g X i log X_i logXi Already included in b i b_i bi in , therefore , At this point, the goal of the model is to learn the representation of the word vector , Make both sides of the above formula as close as possible , therefore , We can calculate the square difference between the two as the objective function , namely ：

J = ∑ i , k = 1 V ( w i T w ~ k + b i + b k − l o g X i k ) 2 J = \sum_{i,k=1}^{V} (w_i^T \tilde{w}_k + b_i + b_k - log X_{ik})^2 J=i,k=1∑V(wiTw~k+bi+bk−logXik)2

But such an objective function has one drawback , The same weight is used for all co-occurrence words , therefore , The objective function is further modified by the author , Through the co-occurrence statistics of words in the corpus to change their weight in the objective function , As follows ：

J = ∑ i , k = 1 V f ( X i k ) ( w i T w ~ k + b i + b k − l o g X i k ) 2 J = \sum_{i,k=1}^{V}f(X_{ik}) (w_i^T \tilde{w}_k + b_i + b_k - log X_{ik})^2 J=i,k=1∑Vf(Xik)(wiTw~k+bi+bk−logXik)2

here V V V It means the number of words , And the weight function f f f Must have the following characteristics ：

- f ( 0 ) = 0 f(0)=0 f(0)=0, When the number of co-occurrence of words is 0 when , At this time, the corresponding weight should be 0
- f ( x ) f(x) f(x) It must be a nondecreasing function , In this way, we can ensure that when the number of co-occurrence of words is greater , Its weight will not decline
- For words that are too frequent , f ( x ) f(x) f(x) Should be able to give them a relatively small number , In this way, there will be no transitional weighting

Combined with the above three characteristics , The author proposes the following weight function ：

f ( x ) = { ( x / x m a x ) a i f x < x m a x 1 o t h e r w i s e f(x) = \left\{\begin{matrix} (x / x_{max})^a & if \,x < x_{max}\\ 1 & otherwise \end{matrix}\right. f(x)={ (x/xmax)a1ifx<xmaxotherwise

In the experiment, the author set x m a x = 100 x_{max} = 100 xmax=100, And found α = 3 / 4 \alpha = 3/4 α=3/4 When the effect is better , The image of the function is shown in the following figure ：

## 3、 summary

That's all about G l o V e GloVe GloVe Introduction to the principle , The author is actually based on the initial conjecture to simplify the calculation goal of the model step by step , Finally, let's see GloVe In fact, it is not difficult to calculate the objective function of , But it's hard to think of such an objective function from the very beginning . At the end of the paper, I'd like to make a summary ：

- G l o v e Glove Glove It combines the advantages of global co-occurrence statistics and local window context method , It can be said that it is a synthesis of the two mainstream methods , But compared with the global matrix decomposition method , because G l o V e GloVe GloVe You don't need to count the co occurrences as 0 's vocabulary , therefore , The amount of data and storage space can be greatly reduced
- however G l o V e GloVe GloVe The co-occurrence times of word frequency in the corpus are regarded as the target of word vector learning , When the corpus is small , Some words may co-exist less often , The author thinks that there may be a misleading word vector training direction

【 technical service 】, Click View for details ： https://mp.weixin.qq.com/s/PtX9ukKRBmazAWARprGIAg

scan Pay attention to WeChat public number ！ Master Focus on search and recommendation systems , Try to use algorithms to better serve users , Including but not limited to machine learning , Deep learning , Reinforcement learning , natural language understanding , Knowledge map , Not yet regularly sharing technology , Information , Thinking and other articles ！

版权声明

本文为[osc_ boe54hqe]所创，转载请带上原文链接，感谢

https://cdmana.com/2020/12/20201224095702761r.html