# Stanford NLP course | Lecture 11 - convolutional neural networks in NLP

ShowMeAI by Stanford CS224n《 Natural language processing and deep learning (Natural Language Processing with Deep Learning)》 All courseware of the course , Did Chinese translation and annotation , And made it into GIF Moving graph ！

The content of this lecture is In depth summary tutorial Can be in here see . For the acquisition methods of videos, courseware and other materials, see At the end of the article .

# introduction

## Teaching plan

• Announcements
• Intro to CNNs / Convolution neural network introduction
• Simple CNN for Sentence Classification: Yoon (2014) / application CNN Do text categorization
• CNN potpourri / CNN details
• Deep CNN for Sentence Classification: Conneauet al. (2017) / depth CNN For text categorization
• Quasi-recurrent Neural Networks / Q-RNN Model

## Welcome to the second half of the course ！

• Now? , We are preparing you to be DL+NLP The researchers / practitioner

• Courses don't always have all the details

• It depends on what you search online / Read to learn more
• This is an active research area , Sometimes there is no clear answer
• Staff I'd be happy to discuss with you , But you need to think for yourself
• Assignments are designed to cope with the real difficulties of the project

• Each task has deliberately less help material than the previous task
• In the project , Not provided autograder Or rationality check
• DL Debugging is difficult , But you need to learn how to debug ！

# 1. Convolution neural network introduction

( Convolutional neural network can also refer to ShowMeAI A summary of teacher Wu Enda's course In depth learning course | Convolutional neural network

## 1.1 from RNN To CNN

• Cyclic neural networks cannot capture phrases without prefix context
• Often too much information captured in the final vector comes from the last word content
• for example ：softmax It is usually calculated only in the last step

• CNN / Convnet Of Main idea
• What if we calculate vectors for each specific length of word subsequence ？
• for example ：tentative deal reached to keep government open
• The calculated vector is
• tentative deal reached, deal reached to, reached to keep, to keep government, keep government open
• Whether the phrase is grammatical or not
• Not very plausible linguistically or cognitively
• Then group them ( Soon )

## 1.3 What is convolution

• One dimensional discrete convolution is generally ：$(f \ast g)[n]=\sum_{m=-M}^{M} f[n-m] g[m]$
• Convolution is usually used to extract features from images
• Identification of model position invariance
• You can refer to Stanford deep learning and computer vision course cs231n ( It can also be in ShowMeAI Look up cs231n A series of notes to learn )
• Two dimensional example ：
• Yellow and red numbers show filters (= kernel ) The weight
• Green displays the input
• Pink shows the output

## 1.4 One dimensional convolution of text

• For text applications 1 Dimensional convolution

## 1.5 One dimensional convolution of filled text

• The input length is $L$ The word sequence of
• Suppose the word dimension is 4, That is to say 4 channels
• After convolution, we will get 1 channel
• Multiple channel, Then you finally get multiple channel Output , The potential characteristics of the concerned text are also different

## 1.6 conv1d, Fill to maximize pooling over time

• Average pooling pair feature map Averaging

## 1.7 PyTorch Realization

• Pytorch In the implementation of ： The parameters correspond well to the details mentioned earlier
batch_size= 16
word_embed_size= 4
seq_len= 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3, kernel_size=3) # can add: padding=1
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool
Copy code 

## 1.8 step ( Here for 2)

• stride step , Reduce computation

## 1.9 Local maximum pooling

• Do... Every two lines max pooling, It is called step size 2 Local maximum pooling of

## 1.10 1 Dimensional convolution k-max pooling

• Record each channel All the time top k Activation value of , And keep it in the original order ( In the previous example -0.2 0.3)

## 1.11 Cavity convolution ：dilation by 2

Expansion convolution / Cavity convolution

• In the above example , Yes 1 3 5 Lines are convoluted , Through two filter Get two channel Activation value of
• The convolution kernel can be separated from 3 Change it to 5, You can achieve this effect , It ensures that the matrix is small , It also ensures that a wider range of sentences can be seen in a convolution

Supplementary explanation / Summary

• CNN in , It is a very important concept to see how much content a sentence can contain at a time
• You can use bigger filter、 Expand the convolution or increase the convolution depth ( The layer number )

# 2. application CNN Do text categorization

## 2.1 A single layer for sentence classification CNN

• The goal is ： Classification of sentence
• It's mainly about identifying and judging sentences Positive or negative emotions
• Other tasks
• Judge the sentence Subjective or objective
• Problem classification ： The question is about what entity ？ About people 、 place 、 Numbers 、……

• A convolution and Pooling layer Simple use
• The word vector ：$\mathbf{x}_{i} \in \mathbb{R}^{k}$
• The sentence ：$\mathbf{x}_{1 : n}=\mathbf{x}_{1} \oplus x_{2} \oplus \cdots \oplus \mathbf{x}_{n}$ ( Vector join )
• Connect $\mathbf{X}_{i : i+j}$ Sentences within the scope ( Symmetry is more common )
• Convolution kernel $\mathbf{w} \in \mathbb{R}^{h k}$ ( The scope of action is $h$ A word window )
• Be careful ,filter It's a vector ,size It can be 2、3 or 4

## 2.2 monolayer CNN

• filter $w$ Apply to all possible windows ( Connection vector )
• by CNN Layer calculation characteristics ( A passage )
$c_{i}=f\left(\mathbf{w}^{T} \mathbf{x}_{i : i+h-1}+b\right)$
• The sentence $\mathbf{x}_{1 : n}=\mathbf{x}_{1} \oplus \mathbf{x}_{2} \oplus \ldots \oplus \mathbf{x}_{n}$

• All possible lengths are $h$ The window of $\left\{\mathbf{x}_{1 : h}, \mathbf{x}_{2 : h+1}, \dots, \mathbf{x}_{n-h+1 : n}\right\}$

• The result is a feature map $\mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}$

## 2.3 Pooling and number of channels

• Pooling ：max-over-time pooling layer
• idea ： Capture the most important activation (maximum over time)
• from feature map in $\mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}$
• Pool to get a single number $\hat{c}=\max \{\mathbf{c}\}$
• Use multiple filter weights $w$
• Different window sizes $h$ It is useful to
• Due to maximum pooling $\hat{c}=\max \{\mathbf{c}\}$, and $c$ The length of is irrelevant
$\mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}$
• So we can have some filters To observe unigrams、bigrams、tri-grams、4-grams wait

## 2.4 Multi channel input data

• Initialize with a pre trained word vector (word2vec or Glove)
• Start with two copies
• Only right 1 Copies were back propagated , Others keep  static state
• Both channel sets are added to... Before maximizing pooling $c_i$

## 2.5 Classification after one CNN layer

• The first is a convolution , Then there is a maximum pool
• In order to get the final eigenvector $\mathbf{z}=\left[\hat{c}_{1}, \dots, \hat{c}_{m}\right]$
• Suppose we have $m$ Convolution kernels ( filter filter) $w$
• Use 100 The sizes are 3、4、5 Characteristic graph
• The end is simple softmax layer $y=\operatorname{softmax}\left(W^{(S)} z+b\right)$

Supplementary explanation

• arxiv.org/pdf/1510.03…
• The input length is 7 A word of , The dimension of each word is 5 , That is, the input matrix is $7 \times 5$
• Use different filter_size : (2,3,4), And each size Both use two filter, Get two channel Of feature, Total 6 individual filter
• For each filter Of feature Conduct 1-max pooling after , Spliced to get 6 Dimension vector , And use softmax Then get the second classification result

## 2.6 Regularization Regularization

• Use Dropout： Probability of use $p$ ( Hyperparameters ) Bernoulli random variable of ( Only 0 1 also $p$ Is for $1$ Probability ) establish mask vector $r$
• Delete features during training
$y=\operatorname{softmax}\left(W^{(S)}(r \circ z)+b\right)$
• explain ： Prevent mutual adaptation ( Overfitting of specific features )
• Not applicable during testing Dropout, Probability of use $p$ Scale the final vector
$\hat{W}^{(S)}=p W^{(S)}$
• Besides ： Limit the weight vector of each class L2 Norm (softmax The weight $W^{(S)}$ Each line ) Not more than a fixed number $s$ ( It's also a super parameter )
• If $\left\|W_{c}^{(S)}\right\|>s$ , Rescale to $\left\|W_{c}^{(S)}\right\|=s$
• Not very often

# 3.CNN details

## 3.1 CNN Parameter discussion

• Based on validation set (dev) Adjust super parameters
• Activation function ：Relu
• Window filter size $h=3,4,5$
• Each filter size has 100 Feature mapping
• Dropout$p=0.5$
• Kim(2014 year ) According to the report , from Dropout Look at , Improved accuracy $2 - 4 \%$
• softmax Yes L2 constraint ,$s=3$
• SGD Minimum batch of training ：$50$
• The word vector ： use word2vec Preliminary training ,$k=300$
• During training , Constantly check the performance of the validation set , And select the weight with the highest accuracy for final evaluation

## 3.2 experimental result

• Experimental results under different parameter settings

## 3.3 contrast CNN And RNN

• Dropout Provides $2 - 4 \%$ Accuracy improvement of
• But several comparison systems are not used Dropout, And may get the same benefit from it
• Still seen as a significant result of a simple architecture
• With the windows and... We described in previous lessons RNN Differences in Architecture ： Pooling 、 Many filters and Dropout
• Some of these ideas can be used in RNNs in

## 3.4 Model comparison

• The word bag model / Bag of Vectors： For simple classification problems , This is a very good baseline . Especially if there are several ReLU layer (See paper: Deep Averaging Networks)
• Word window classification / Window Model： For issues that do not require a broad context ( Apply to local problem ), Suitable for word classification . for example POS、NER
• Convolutional neural networks / CNN： Suitable for classification , Shorter phrases need zero padding , Hard to explain , Easy in gpu Parallelization on the Internet
• Cyclic neural network / RNN： Left to right cognition is more credible , Not suitable for classification ( If only the last state is used ), Than CNNs Much slower , Suitable for sequence marking and classification and language model , It's great when combined with the attention mechanism

Supplementary explanation

• RNN It has a good effect on things like sequence marking and classification , And language models predict the next word , And combined with the attention mechanism, it will achieve good results , But for the overall interpretation of a sentence ,CNN Do better

## 3.5 Application of jumper structure

• We are LSTMs and GRUs See in the door / Jump over Is a universal concept , This concept is now used in many places
• You can still use it ** The longitudinal ** The door of
• actually , Key concepts —— Summation of candidate updates with quick links —— It is necessary for very deep network work
• Note： When adding them , Please put $x$ Fill in conv The same dimension , And then sum up

## 3.6 Batch of normalization BatchNorm

• Commonly used in CNNs
• By scaling the activation amount to zero mean and unit variance , To a mini-batch The convolution output of
• This is familiar in Statistics Z-transform
• But in each group mini-batch Will update , So fluctuations have little effect
• Use BatchNorm Make the model less sensitive to parameter initialization , Because the output is automatically rescaled
• It will also make the tuning of learning rate easier , The training of the model will be more stable
• PyTorch：nn.BatchNorm1d

## 3.7 1x1 Convolution

• 1x1 Does the convolution of work Yes .
• 1x1 Convolution , That is... In the network Network-in-network (NiN) connections, Yes, the kernel size is 1 Convolution kernel of
• 1x1 Convolution provides a fully connected linear layer across channels
• It can be used to map from multiple channels to fewer channels
• 1x1 Convolution adds an additional neural network layer , There are few additional parameters
• Connected to the whole (FC) Different layers —— Full connection (FC) Layer adds a lot of parameters

## 3.8 CNN application ： Machine translation

• One of the earliest successful neural machine translation
• Use CNN Encoding , Use RNN decode
• Kalchbrennerand Blunsom(2013) Recurrent Continuous Translation Models

## 3.9 # Interpretation of the thesis # Learning Character-level Representations for Part-of-Speech Tagging

• Convolute characters to generate word embedding
• Fixed window word embedding is used for POS label

## 3.10 # Interpretation of the thesis # Character-Aware Neural Language Models

• Character based word embedding
• Using convolution 、highway network and LSTM

# 4. depth CNN For text categorization

## 4.1 Deep convolution network is used for text classification

• The starting point ： Sequence model (LSTMs) stay NLP In the middle of the world ; also CNNs、 Attention, etc , But all the models are basically not very deep —— Unlike the depth model in computer vision
• When we NLP What happens when building a vision like system
• Work from character level

## 4.2 VD-CNN structure

• In the whole system and visual neural network model VGG and ResNet The structure is a bit like
• Not much like a typical deep learning NLP System
• The result is a fixed size , Because the text is truncated or filled to a uniform length
• Each stage has a local pooling operation , Number of features double

## 4.3 VD-CNN Convolution module

• Each convolution block is two convolution layers , Behind each convolution is BatchNorm And a ReLU
• The convolution size is 3
• pad To maintain ( Or halve in local pooling ) dimension

## 4.4 experimental result

• Use large text classification data sets
• Than NLP The small data sets often used in are much larger , Such as Yoon Kim(2014) The paper of

Supplementary explanation

• The above data are error rates , So the lower the better
• Deep network will achieve better results , The residual layer achieves good results , However, the deeper the depth, the less effective it is
• Experiments show that MaxPooling Than KMaxPooling and Use stride Convolution of Two other pooling methods are better
• ConvNets It can help us establish a good text classification system

## 4.5 RNNs It's slow

• RNNs Is depth NLP A very standard building block of
• But their parallelism is very poor , So it's very slow
• idea ： take RNNs and CNNs The best and parallelizable part of

# 5.Q-RNN Model

## 5.1 Quasi-Recurrent Neural Network

• Try to combine the advantages of the two model families
• Time parallel convolution , Convolution calculation , Forgetting gate and output gate
\begin{aligned} \mathbf{z}_{t} &=\tanh \left(\mathbf{W}_{z}^{1} \mathbf{x}_{t-1}+\mathbf{W}_{z}^{2} \mathbf{x}_{t}\right) \\ \mathbf{f}_{t} &=\sigma\left(\mathbf{W}_{f}^{1} \mathbf{x}_{t-1}+\mathbf{W}_{f}^{2} \mathbf{x}_{t}\right) \\ \mathbf{o}_{t} &=\sigma\left(\mathbf{W}_{o}^{1} \mathbf{x}_{t-1}+\mathbf{W}_{o}^{2} \mathbf{x}_{t}\right) \end{aligned}
\begin{aligned} \mathbf{Z} &=\tanh \left(\mathbf{W}_{z} * \mathbf{X}\right) \\ \mathbf{F} &=\sigma\left(\mathbf{W}_{f} * \mathbf{X}\right) \\ \mathbf{O} &=\sigma\left(\mathbf{W}_{o} * \mathbf{X}\right) \end{aligned}
• The element by element gated pseudo recursion of cross channel parallelism is completed in the pooling layer
$\mathbf{h}_{t}=\mathbf{f}_{t} \odot \mathbf{h}_{t-1}+\left(1-\mathbf{f}_{t}\right) \odot \mathbf{z}_{t}$

## 5.3 Q-RNNs： Sentiment analysis

• Often than LSTMs Better, faster
• Can explain better

## 5.4 QRNN The limitation of

• For character level LMs Don't like LSTMs That works
• Longer dependency problems encountered in modeling
• A deeper network is usually needed to get access to LSTM Same good performance
• As they go deeper , Still faster
• Effectively use depth as an alternative to true recursion

## 5.5 RNN The shortcomings of &Transformer Proposed motivation

• We hope to speed up in parallel , but RNN It's serial
• Even though GRUs and LSTMs,RNNs Long term dependence can be captured through the attention mechanism , But as the sequence grows , The paths that need to be calculated are also growing
• If the attention mechanism itself allows us to focus on information anywhere , Maybe we don't need RNN？

# 6. Video tutorial

You can click on the B standing View the video 【 Bilingual subtitles 】 edition

# ShowMeAI A series of tutorials are recommended

https://cdmana.com/2022/134/202205141312284819.html

Scroll to Top