编程知识 cdmana.com

Stanford NLP course | Lecture 11 - convolutional neural networks in NLP

ShowMeAI research center


NLP Convolutional neural networks in ShowMeAI by Stanford CS224n《 Natural language processing and deep learning (Natural Language Processing with Deep Learning)》 All courseware of the course , Did Chinese translation and annotation , And made it into GIF Moving graph !

NLP Convolutional neural networks in The content of this lecture is In depth summary tutorial Can be in here see . For the acquisition methods of videos, courseware and other materials, see At the end of the article .


introduction

NLP Convolutional neural networks in

Teaching plan

 Teaching plan

  • Announcements
  • Intro to CNNs / Convolution neural network introduction
  • Simple CNN for Sentence Classification: Yoon (2014) / application CNN Do text categorization
  • CNN potpourri / CNN details
  • Deep CNN for Sentence Classification: Conneauet al. (2017) / depth CNN For text categorization
  • Quasi-recurrent Neural Networks / Q-RNN Model

Welcome to the second half of the course !

 Welcome to the second half of the course !

  • Now? , We are preparing you to be DL+NLP The researchers / practitioner

  • Courses don't always have all the details

    • It depends on what you search online / Read to learn more
    • This is an active research area , Sometimes there is no clear answer
    • Staff I'd be happy to discuss with you , But you need to think for yourself
  • Assignments are designed to cope with the real difficulties of the project

    • Each task has deliberately less help material than the previous task
    • In the project , Not provided autograder Or rationality check
    • DL Debugging is difficult , But you need to learn how to debug !

1. Convolution neural network introduction

( Convolutional neural network can also refer to ShowMeAI A summary of teacher Wu Enda's course In depth learning course | Convolutional neural network

1.1 from RNN To CNN

 from RNN To CNN

  • Cyclic neural networks cannot capture phrases without prefix context
  • Often too much information captured in the final vector comes from the last word content
  • for example :softmax It is usually calculated only in the last step

 from RNN To CNN

  • CNN / Convnet Of Main idea
    • What if we calculate vectors for each specific length of word subsequence ?
  • for example :tentative deal reached to keep government open
  • The calculated vector is
    • tentative deal reached, deal reached to, reached to keep, to keep government, keep government open
  • Whether the phrase is grammatical or not
  • Not very plausible linguistically or cognitively
  • Then group them ( Soon )

1.2 CNN Convolutional neural networks

CNN Convolutional neural networks

1.3 What is convolution

 What is convolution ?

  • One dimensional discrete convolution is generally : ( f g ) [ n ] = m = M M f [ n m ] g [ m ] (f \ast g)[n]=\sum_{m=-M}^{M} f[n-m] g[m]
  • Convolution is usually used to extract features from images
    • Identification of model position invariance
    • You can refer to Stanford deep learning and computer vision course cs231n ( It can also be in ShowMeAI Look up cs231n A series of notes to learn )
  • Two dimensional example :
    • Yellow and red numbers show filters (= kernel ) The weight
    • Green displays the input
    • Pink shows the output

1.4 One dimensional convolution of text

 One dimensional convolution of text

  • For text applications 1 Dimensional convolution

1.5 One dimensional convolution of filled text

 One dimensional convolution of filled text

  • The input length is L L The word sequence of
    • Suppose the word dimension is 4, That is to say 4 channels
    • After convolution, we will get 1 channel
  • Multiple channel, Then you finally get multiple channel Output , The potential characteristics of the concerned text are also different

1.6 conv1d, Fill to maximize pooling over time

conv1d, Fill to maximize pooling over time

  • Average pooling pair feature map Averaging

1.7 PyTorch Realization

PyTorch Realization

  • Pytorch In the implementation of : The parameters correspond well to the details mentioned earlier
batch_size= 16
word_embed_size= 4
seq_len= 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3, kernel_size=3) # can add: padding=1 
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool
 Copy code 

1.8 step ( Here for 2)

CNN step

  • stride step , Reduce computation

1.9 Local maximum pooling

 Other concepts : Local maximum pooling , step=2

  • Do... Every two lines max pooling, It is called step size 2 Local maximum pooling of

1.10 1 Dimensional convolution k-max pooling

conv1d, k-max pooling over time, k=2

  • Record each channel All the time top k Activation value of , And keep it in the original order ( In the previous example -0.2 0.3)

1.11 Cavity convolution :dilation by 2

 Other concepts :dilation = 2

Expansion convolution / Cavity convolution

  • In the above example , Yes 1 3 5 Lines are convoluted , Through two filter Get two channel Activation value of
  • The convolution kernel can be separated from 3 Change it to 5, You can achieve this effect , It ensures that the matrix is small , It also ensures that a wider range of sentences can be seen in a convolution

Supplementary explanation / Summary

  • CNN in , It is a very important concept to see how much content a sentence can contain at a time
  • You can use bigger filter、 Expand the convolution or increase the convolution depth ( The layer number )

2. application CNN Do text categorization

2.1 A single layer for sentence classification CNN

 A single layer for sentence classification CNN

  • The goal is : Classification of sentence
    • It's mainly about identifying and judging sentences Positive or negative emotions
    • Other tasks
      • Judge the sentence Subjective or objective
      • Problem classification : The question is about what entity ? About people 、 place 、 Numbers 、……

 A single layer for sentence classification CNN

  • A convolution and Pooling layer Simple use
  • The word vector : x i R k \mathbf{x}_{i} \in \mathbb{R}^{k}
  • The sentence : x 1 : n = x 1 x 2 x n \mathbf{x}_{1 : n}=\mathbf{x}_{1} \oplus x_{2} \oplus \cdots \oplus \mathbf{x}_{n} ( Vector join )
  • Connect X i : i + j \mathbf{X}_{i : i+j} Sentences within the scope ( Symmetry is more common )
  • Convolution kernel w R h k \mathbf{w} \in \mathbb{R}^{h k} ( The scope of action is h h A word window )
  • Be careful ,filter It's a vector ,size It can be 2、3 or 4

2.2 monolayer CNN

 monolayer CNN

  • filter w w Apply to all possible windows ( Connection vector )
  • by CNN Layer calculation characteristics ( A passage )
c i = f ( w T x i : i + h 1 + b ) c_{i}=f\left(\mathbf{w}^{T} \mathbf{x}_{i : i+h-1}+b\right)
  • The sentence x 1 : n = x 1 x 2 x n \mathbf{x}_{1 : n}=\mathbf{x}_{1} \oplus \mathbf{x}_{2} \oplus \ldots \oplus \mathbf{x}_{n}

  • All possible lengths are h h The window of { x 1 : h , x 2 : h + 1 , , x n h + 1 : n } \left\{\mathbf{x}_{1 : h}, \mathbf{x}_{2 : h+1}, \dots, \mathbf{x}_{n-h+1 : n}\right\}

  • The result is a feature map c = [ c 1 , c 2 , , c n h + 1 ] R n h + 1 \mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}

2.3 Pooling and number of channels

 Pooling and number of channels

  • Pooling :max-over-time pooling layer
  • idea : Capture the most important activation (maximum over time)
  • from feature map in c = [ c 1 , c 2 , , c n h + 1 ] R n h + 1 \mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}
  • Pool to get a single number c ^ = max { c } \hat{c}=\max \{\mathbf{c}\}
  • Use multiple filter weights w w
  • Different window sizes h h It is useful to
  • Due to maximum pooling c ^ = max { c } \hat{c}=\max \{\mathbf{c}\} , and c c The length of is irrelevant
c = [ c 1 , c 2 , , c n h + 1 ] R n h + 1 \mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}
  • So we can have some filters To observe unigrams、bigrams、tri-grams、4-grams wait

2.4 Multi channel input data

 Multi channel input data

  • Initialize with a pre trained word vector (word2vec or Glove)
  • Start with two copies
  • Only right 1 Copies were back propagated , Others keep static state
  • Both channel sets are added to... Before maximizing pooling c i c_i

2.5 Classification after one CNN layer

Classification after one CNN layer

  • The first is a convolution , Then there is a maximum pool
  • In order to get the final eigenvector z = [ c ^ 1 , , c ^ m ] \mathbf{z}=\left[\hat{c}_{1}, \dots, \hat{c}_{m}\right]
    • Suppose we have m m Convolution kernels ( filter filter) w w
    • Use 100 The sizes are 3、4、5 Characteristic graph
  • The end is simple softmax layer y = softmax ( W ( S ) z + b ) y=\operatorname{softmax}\left(W^{(S)} z+b\right)

Supplementary explanation

  • arxiv.org/pdf/1510.03…
  • The input length is 7 A word of , The dimension of each word is 5 , That is, the input matrix is 7 × 5 7 \times 5
  • Use different filter_size : (2,3,4), And each size Both use two filter, Get two channel Of feature, Total 6 individual filter
  • For each filter Of feature Conduct 1-max pooling after , Spliced to get 6 Dimension vector , And use softmax Then get the second classification result

2.6 Regularization Regularization

Regularization Regularization

  • Use Dropout: Probability of use p p ( Hyperparameters ) Bernoulli random variable of ( Only 0 1 also p p Is for 1 1 Probability ) establish mask vector r r
  • Delete features during training
y = softmax ( W ( S ) ( r z ) + b ) y=\operatorname{softmax}\left(W^{(S)}(r \circ z)+b\right)
  • explain : Prevent mutual adaptation ( Overfitting of specific features )
  • Not applicable during testing Dropout, Probability of use p p Scale the final vector
W ^ ( S ) = p W ( S ) \hat{W}^{(S)}=p W^{(S)}
  • Besides : Limit the weight vector of each class L2 Norm (softmax The weight W ( S ) W^{(S)} Each line ) Not more than a fixed number s s ( It's also a super parameter )
  • If W c ( S ) > s \left\|W_{c}^{(S)}\right\|>s , Rescale to W c ( S ) = s \left\|W_{c}^{(S)}\right\|=s
    • Not very often

3.CNN details

3.1 CNN Parameter discussion

All hyperparameters in Kim (2014)

  • Based on validation set (dev) Adjust super parameters
  • Activation function :Relu
  • Window filter size h = 3 , 4 , 5 h=3,4,5
  • Each filter size has 100 Feature mapping
  • Dropout p = 0.5 p=0.5
    • Kim(2014 year ) According to the report , from Dropout Look at , Improved accuracy 2 4 % 2 - 4 \%
  • softmax Yes L2 constraint , s = 3 s=3
  • SGD Minimum batch of training : 50 50
  • The word vector : use word2vec Preliminary training , k = 300 k=300
  • During training , Constantly check the performance of the validation set , And select the weight with the highest accuracy for final evaluation

3.2 experimental result

 experiment

  • Experimental results under different parameter settings

3.3 contrast CNN And RNN

Problem with comparison?

  • Dropout Provides 2 4 % 2 - 4 \% Accuracy improvement of
  • But several comparison systems are not used Dropout, And may get the same benefit from it
  • Still seen as a significant result of a simple architecture
  • With the windows and... We described in previous lessons RNN Differences in Architecture : Pooling 、 Many filters and Dropout
  • Some of these ideas can be used in RNNs in

3.4 Model comparison

Model comparison: Our growing toolkit

  • The word bag model / Bag of Vectors: For simple classification problems , This is a very good baseline . Especially if there are several ReLU layer (See paper: Deep Averaging Networks)
  • Word window classification / Window Model: For issues that do not require a broad context ( Apply to local problem ), Suitable for word classification . for example POS、NER
  • Convolutional neural networks / CNN: Suitable for classification , Shorter phrases need zero padding , Hard to explain , Easy in gpu Parallelization on the Internet
  • Cyclic neural network / RNN: Left to right cognition is more credible , Not suitable for classification ( If only the last state is used ), Than CNNs Much slower , Suitable for sequence marking and classification and language model , It's great when combined with the attention mechanism

Supplementary explanation

  • RNN It has a good effect on things like sequence marking and classification , And language models predict the next word , And combined with the attention mechanism, it will achieve good results , But for the overall interpretation of a sentence ,CNN Do better

3.5 Application of jumper structure

Gated units used vertically

  • We are LSTMs and GRUs See in the door / Jump over Is a universal concept , This concept is now used in many places
  • You can still use it ** The longitudinal ** The door of
  • actually , Key concepts —— Summation of candidate updates with quick links —— It is necessary for very deep network work
  • Note: When adding them , Please put x x Fill in conv The same dimension , And then sum up

3.6 Batch of normalization BatchNorm

Batch Normalization (BatchNorm)

  • Commonly used in CNNs
  • By scaling the activation amount to zero mean and unit variance , To a mini-batch The convolution output of
    • This is familiar in Statistics Z-transform
    • But in each group mini-batch Will update , So fluctuations have little effect
  • Use BatchNorm Make the model less sensitive to parameter initialization , Because the output is automatically rescaled
    • It will also make the tuning of learning rate easier , The training of the model will be more stable
  • PyTorch:nn.BatchNorm1d

3.7 1x1 Convolution

1 x 1 Convolutions

  • 1x1 Does the convolution of work Yes .
  • 1x1 Convolution , That is... In the network Network-in-network (NiN) connections, Yes, the kernel size is 1 Convolution kernel of
  • 1x1 Convolution provides a fully connected linear layer across channels
  • It can be used to map from multiple channels to fewer channels
  • 1x1 Convolution adds an additional neural network layer , There are few additional parameters
    • Connected to the whole (FC) Different layers —— Full connection (FC) Layer adds a lot of parameters

3.8 CNN application : Machine translation

CNN application : Machine translation

  • One of the earliest successful neural machine translation
  • Use CNN Encoding , Use RNN decode
  • Kalchbrennerand Blunsom(2013) Recurrent Continuous Translation Models

3.9 # Interpretation of the thesis # Learning Character-level Representations for Part-of-Speech Tagging

# Interpretation of the thesis # Learning Character-level Representations for Part-of-Speech Tagging

  • Convolute characters to generate word embedding
  • Fixed window word embedding is used for POS label

3.10 # Interpretation of the thesis # Character-Aware Neural Language Models

# Interpretation of the thesis # Character-Aware Neural Language Models

  • Character based word embedding
  • Using convolution 、highway network and LSTM

4. depth CNN For text categorization

4.1 Deep convolution network is used for text classification

 Deep convolution network is used for text classification

  • The starting point : Sequence model (LSTMs) stay NLP In the middle of the world ; also CNNs、 Attention, etc , But all the models are basically not very deep —— Unlike the depth model in computer vision
  • When we NLP What happens when building a vision like system
  • Work from character level

4.2 VD-CNN structure

VD-CNN structure

  • In the whole system and visual neural network model VGG and ResNet The structure is a bit like
  • Not much like a typical deep learning NLP System
  • The result is a fixed size , Because the text is truncated or filled to a uniform length
  • Each stage has a local pooling operation , Number of features double

4.3 VD-CNN Convolution module

Convolutional block in VD-CNN

  • Each convolution block is two convolution layers , Behind each convolution is BatchNorm And a ReLU
  • The convolution size is 3
  • pad To maintain ( Or halve in local pooling ) dimension

4.4 experimental result

 experimental result

  • Use large text classification data sets
    • Than NLP The small data sets often used in are much larger , Such as Yoon Kim(2014) The paper of

Supplementary explanation

  • The above data are error rates , So the lower the better
  • Deep network will achieve better results , The residual layer achieves good results , However, the deeper the depth, the less effective it is
  • Experiments show that MaxPooling Than KMaxPooling and Use stride Convolution of Two other pooling methods are better
  • ConvNets It can help us establish a good text classification system

 experimental result

4.5 RNNs It's slow

RNNs It's slow

  • RNNs Is depth NLP A very standard building block of
  • But their parallelism is very poor , So it's very slow
  • idea : take RNNs and CNNs The best and parallelizable part of

5.Q-RNN Model

5.1 Quasi-Recurrent Neural Network

Quasi-Recurrent Neural Network

  • Try to combine the advantages of the two model families
  • Time parallel convolution , Convolution calculation , Forgetting gate and output gate
z t = tanh ( W z 1 x t 1 + W z 2 x t ) f t = σ ( W f 1 x t 1 + W f 2 x t ) o t = σ ( W o 1 x t 1 + W o 2 x t ) \begin{aligned} \mathbf{z}_{t} &=\tanh \left(\mathbf{W}_{z}^{1} \mathbf{x}_{t-1}+\mathbf{W}_{z}^{2} \mathbf{x}_{t}\right) \\ \mathbf{f}_{t} &=\sigma\left(\mathbf{W}_{f}^{1} \mathbf{x}_{t-1}+\mathbf{W}_{f}^{2} \mathbf{x}_{t}\right) \\ \mathbf{o}_{t} &=\sigma\left(\mathbf{W}_{o}^{1} \mathbf{x}_{t-1}+\mathbf{W}_{o}^{2} \mathbf{x}_{t}\right) \end{aligned}
Z = tanh ( W z X ) F = σ ( W f X ) O = σ ( W o X ) \begin{aligned} \mathbf{Z} &=\tanh \left(\mathbf{W}_{z} * \mathbf{X}\right) \\ \mathbf{F} &=\sigma\left(\mathbf{W}_{f} * \mathbf{X}\right) \\ \mathbf{O} &=\sigma\left(\mathbf{W}_{o} * \mathbf{X}\right) \end{aligned}
  • The element by element gated pseudo recursion of cross channel parallelism is completed in the pooling layer
h t = f t h t 1 + ( 1 f t ) z t \mathbf{h}_{t}=\mathbf{f}_{t} \odot \mathbf{h}_{t-1}+\left(1-\mathbf{f}_{t}\right) \odot \mathbf{z}_{t}

5.2 Q-RNN experiment : Language model

Q-RNN experiment : Language model

5.3 Q-RNNs: Sentiment analysis

Q-RNNs: Sentiment analysis

  • Often than LSTMs Better, faster
  • Can explain better

5.4 QRNN The limitation of

QRNN The limitation of

  • For character level LMs Don't like LSTMs That works
    • Longer dependency problems encountered in modeling
  • A deeper network is usually needed to get access to LSTM Same good performance
    • As they go deeper , Still faster
    • Effectively use depth as an alternative to true recursion

5.5 RNN The shortcomings of &Transformer Proposed motivation

RNN The shortcomings of &Transformer Proposed motivation

  • We hope to speed up in parallel , but RNN It's serial
  • Even though GRUs and LSTMs,RNNs Long term dependence can be captured through the attention mechanism , But as the sequence grows , The paths that need to be calculated are also growing
  • If the attention mechanism itself allows us to focus on information anywhere , Maybe we don't need RNN?

6. Video tutorial

You can click on the B standing View the video 【 Bilingual subtitles 】 edition

7. Reference material

ShowMeAI A series of tutorials are recommended

showmeai Accelerate every technological growth with knowledge

版权声明
本文为[ShowMeAI]所创,转载请带上原文链接,感谢
https://cdmana.com/2022/134/202205141312284819.html

Scroll to Top