- author ： Han Xinzi @ShowMeAI, Lu yao @ShowMeAI, Kiwi fruit @ShowMeAI
- Tutorial address ：www.showmeai.tech/tutorials/3…
- This paper addresses ：www.showmeai.tech/article-det…
- Statement ： copyright , For reprint, please contact the platform and the author and indicate the source
- Collection ShowMeAI See more

ShowMeAI by ** Stanford CS224n**《 Natural language processing and deep learning (Natural Language Processing with Deep Learning)》 All courseware of the course , Did ** Chinese translation and annotation **, And made it into GIF Moving graph ！

The content of this lecture is ** In depth summary tutorial ** Can be in ** here ** see . For the acquisition methods of videos, courseware and other materials, see ** At the end of the article **.

# introduction

## Teaching plan

- Announcements
- Intro to CNNs /
**Convolution neural network introduction** - Simple CNN for Sentence Classification: Yoon (2014) /
**application CNN Do text categorization** - CNN potpourri /
**CNN details** - Deep CNN for Sentence Classification: Conneauet al. (2017) /
**depth CNN For text categorization** - Quasi-recurrent Neural Networks /
**Q-RNN Model**

## Welcome to the second half of the course ！

Now? , We are preparing you to be DL+NLP The researchers / practitioner

**Courses don't always have all the details**- It depends on what you search online / Read to learn more
- This is an active research area , Sometimes there is no clear answer
- Staff I'd be happy to discuss with you , But you need to think for yourself

**Assignments are designed to cope with the real difficulties of the project**- Each task has deliberately less help material than the previous task
- In the project , Not provided autograder Or rationality check
- DL Debugging is difficult , But you need to learn how to debug ！

# 1. Convolution neural network introduction

( Convolutional neural network can also refer to ShowMeAI A summary of teacher Wu Enda's course In depth learning course | ** Convolutional neural network **

## 1.1 from RNN To CNN

- Cyclic neural networks cannot capture phrases without prefix context
- Often too much information captured in the final vector comes from the last word content

- for example ：softmax It is usually calculated only in the last step

- CNN / Convnet Of
**Main idea**：- What if we calculate vectors for each specific length of word subsequence ？

- for example ：
`tentative deal reached to keep government open`

- The calculated vector is
- tentative deal reached, deal reached to, reached to keep, to keep government, keep government open

- Whether the phrase is grammatical or not
- Not very plausible linguistically or cognitively
- Then group them ( Soon )

## 1.2 CNN Convolutional neural networks

## 1.3 What is convolution

- One dimensional discrete convolution is generally ：$(f \ast g)[n]=\sum_{m=-M}^{M} f[n-m] g[m]$
- Convolution is usually used to extract features from images
- Identification of model position invariance
- You can refer to Stanford deep learning and computer vision course cs231n ( It can also be in ShowMeAI Look up cs231n A series of notes to learn )

- Two dimensional example ：
- Yellow and red numbers show filters (= kernel ) The weight
- Green displays the input
- Pink shows the output

## 1.4 One dimensional convolution of text

- For text applications 1 Dimensional convolution

## 1.5 One dimensional convolution of filled text

- The input length is $L$ The word sequence of
- Suppose the word dimension is 4, That is to say 4 channels
- After convolution, we will get 1 channel

- Multiple channel, Then you finally get multiple channel Output , The potential characteristics of the concerned text are also different

## 1.6 conv1d, Fill to maximize pooling over time

- Average pooling pair feature map Averaging

## 1.7 PyTorch Realization

- Pytorch In the implementation of ： The parameters correspond well to the details mentioned earlier

```
batch_size= 16
word_embed_size= 4
seq_len= 7
input = torch.randn(batch_size, word_embed_size, seq_len)
conv1 = Conv1d(in_channels=word_embed_size, out_channels=3, kernel_size=3) # can add: padding=1
hidden1 = conv1(input)
hidden2 = torch.max(hidden1, dim=2) # max pool
Copy code
```

## 1.8 step ( Here for 2)

- stride step , Reduce computation

## 1.9 Local maximum pooling

- Do... Every two lines max pooling, It is called step size 2 Local maximum pooling of

## 1.10 1 Dimensional convolution k-max pooling

- Record each channel All the time top k Activation value of , And keep it in the original order ( In the previous example -0.2 0.3)

## 1.11 Cavity convolution ：dilation by 2

** Expansion convolution / Cavity convolution **

- In the above example , Yes 1 3 5 Lines are convoluted , Through two filter Get two channel Activation value of
- The convolution kernel can be separated from 3 Change it to 5, You can achieve this effect , It ensures that the matrix is small , It also ensures that a wider range of sentences can be seen in a convolution

** Supplementary explanation / Summary**

- CNN in , It is a very important concept to see how much content a sentence can contain at a time
- You can use bigger filter、 Expand the convolution or increase the convolution depth ( The layer number )

# 2. application CNN Do text categorization

## 2.1 A single layer for sentence classification CNN

- The goal is ：
**Classification of sentence**- It's mainly about identifying and judging sentences
**Positive or negative emotions** - Other tasks
- Judge the sentence
**Subjective or objective** - Problem classification ： The question is about what entity ？ About people 、 place 、 Numbers 、……

- Judge the sentence

- It's mainly about identifying and judging sentences

- A convolution and
**Pooling layer**Simple use - The word vector ：$\mathbf{x}_{i} \in \mathbb{R}^{k}$
- The sentence ：$\mathbf{x}_{1 : n}=\mathbf{x}_{1} \oplus x_{2} \oplus \cdots \oplus \mathbf{x}_{n}$ ( Vector join )
- Connect $\mathbf{X}_{i : i+j}$ Sentences within the scope ( Symmetry is more common )
- Convolution kernel $\mathbf{w} \in \mathbb{R}^{h k}$ ( The scope of action is $h$ A word window )
- Be careful ,filter It's a vector ,size It can be 2、3 or 4

## 2.2 monolayer CNN

- filter $w$ Apply to all possible windows ( Connection vector )
- by CNN Layer calculation characteristics ( A passage )

The sentence $\mathbf{x}_{1 : n}=\mathbf{x}_{1} \oplus \mathbf{x}_{2} \oplus \ldots \oplus \mathbf{x}_{n}$

All possible lengths are $h$ The window of $\left\{\mathbf{x}_{1 : h}, \mathbf{x}_{2 : h+1}, \dots, \mathbf{x}_{n-h+1 : n}\right\}$

The result is a feature map $\mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}$

## 2.3 Pooling and number of channels

- Pooling ：max-over-time pooling layer
**idea**： Capture the most important activation (maximum over time)- from feature map in $\mathbf{c}=\left[c_{1}, c_{2}, \dots, c_{n-h+1}\right] \in \mathbb{R}^{n-h+1}$
- Pool to get a single number $\hat{c}=\max \{\mathbf{c}\}$

- Use multiple filter weights $w$
- Different window sizes $h$ It is useful to
- Due to maximum pooling $\hat{c}=\max \{\mathbf{c}\}$, and $c$ The length of is irrelevant

- So we can have some filters To observe unigrams、bigrams、tri-grams、4-grams wait

## 2.4 Multi channel input data

- Initialize with a pre trained word vector (word2vec or Glove)
- Start with two copies
- Only right 1 Copies were back propagated , Others keep
`static state`

- Both channel sets are added to... Before maximizing pooling $c_i$

## 2.5 Classification after one CNN layer

- The first is a convolution , Then there is a maximum pool

- In order to get the final eigenvector $\mathbf{z}=\left[\hat{c}_{1}, \dots, \hat{c}_{m}\right]$
- Suppose we have $m$ Convolution kernels ( filter filter) $w$
- Use 100 The sizes are 3、4、5 Characteristic graph

- The end is simple softmax layer $y=\operatorname{softmax}\left(W^{(S)} z+b\right)$

** Supplementary explanation **

- arxiv.org/pdf/1510.03…
- The input length is 7 A word of , The dimension of each word is 5 , That is, the input matrix is $7 \times 5$
- Use different
`filter_size : (2,3,4)`

, And each size Both use two filter, Get two channel Of feature, Total 6 individual filter - For each filter Of feature Conduct 1-max pooling after , Spliced to get 6 Dimension vector , And use softmax Then get the second classification result

## 2.6 Regularization Regularization

- Use Dropout： Probability of use $p$ ( Hyperparameters ) Bernoulli random variable of ( Only 0 1 also $p$ Is for $1$ Probability ) establish mask vector $r$

- Delete features during training

**explain**： Prevent mutual adaptation ( Overfitting of specific features )- Not applicable during testing Dropout, Probability of use $p$ Scale the final vector

- Besides ： Limit the weight vector of each class L2 Norm (softmax The weight $W^{(S)}$ Each line ) Not more than a fixed number $s$ ( It's also a super parameter )
- If $\left\|W_{c}^{(S)}\right\|>s$ , Rescale to $\left\|W_{c}^{(S)}\right\|=s$
- Not very often

# 3.CNN details

## 3.1 CNN Parameter discussion

- Based on validation set (dev) Adjust super parameters
- Activation function ：Relu
- Window filter size $h=3,4,5$
- Each filter size has 100 Feature mapping
- Dropout$p=0.5$
- Kim(2014 year ) According to the report , from Dropout Look at , Improved accuracy $2 - 4 \%$

- softmax Yes L2 constraint ,$s=3$
- SGD Minimum batch of training ：$50$
- The word vector ： use word2vec Preliminary training ,$k=300$
- During training , Constantly check the performance of the validation set , And select the weight with the highest accuracy for final evaluation

## 3.2 experimental result

- Experimental results under different parameter settings

## 3.3 contrast CNN And RNN

- Dropout Provides $2 - 4 \%$ Accuracy improvement of
- But several comparison systems are not used Dropout, And may get the same benefit from it

- Still seen as a significant result of a simple architecture
- With the windows and... We described in previous lessons RNN Differences in Architecture ： Pooling 、 Many filters and Dropout
- Some of these ideas can be used in RNNs in

## 3.4 Model comparison

**The word bag model / Bag of Vectors**： For simple classification problems , This is a very good baseline . Especially if there are several ReLU layer (See paper: Deep Averaging Networks)

**Word window classification / Window Model**： For issues that do not require a broad context ( Apply to local problem ), Suitable for word classification . for example POS、NER

**Convolutional neural networks / CNN**： Suitable for classification , Shorter phrases need zero padding , Hard to explain , Easy in gpu Parallelization on the Internet

**Cyclic neural network / RNN**： Left to right cognition is more credible , Not suitable for classification ( If only the last state is used ), Than CNNs Much slower , Suitable for sequence marking and classification and language model , It's great when combined with the attention mechanism

Supplementary explanation

- RNN It has a good effect on things like sequence marking and classification , And language models predict the next word , And combined with the attention mechanism, it will achieve good results , But for the overall interpretation of a sentence ,CNN Do better

## 3.5 Application of jumper structure

- We are LSTMs and GRUs See in the door / Jump over Is a universal concept , This concept is now used in many places
- You can still use it
`** The longitudinal **`

The door of - actually , Key concepts —— Summation of candidate updates with quick links —— It is necessary for very deep network work

**Note**： When adding them , Please put $x$ Fill in conv The same dimension , And then sum up

## 3.6 Batch of normalization BatchNorm

- Commonly used in CNNs
- By scaling the activation amount to zero mean and unit variance , To a mini-batch The convolution output of
- This is familiar in Statistics Z-transform
- But in each group mini-batch Will update , So fluctuations have little effect

- Use BatchNorm Make the model less sensitive to parameter initialization , Because the output is automatically rescaled
- It will also make the tuning of learning rate easier , The training of the model will be more stable

- PyTorch：
`nn.BatchNorm1d`

## 3.7 1x1 Convolution

**1x1 Does the convolution of work**？**Yes**.

- 1x1 Convolution , That is... In the network Network-in-network (NiN) connections, Yes, the kernel size is 1 Convolution kernel of
- 1x1 Convolution provides a fully connected linear layer across channels
- It can be used to map from multiple channels to fewer channels
- 1x1 Convolution adds an additional neural network layer , There are few additional parameters
- Connected to the whole (FC) Different layers —— Full connection (FC) Layer adds a lot of parameters

## 3.8 CNN application ： Machine translation

- One of the earliest successful neural machine translation
- Use CNN Encoding , Use RNN decode
- Kalchbrennerand Blunsom(2013)
`Recurrent Continuous Translation Models`

## 3.9 # Interpretation of the thesis # Learning Character-level Representations for Part-of-Speech Tagging

- Convolute characters to generate word embedding
- Fixed window word embedding is used for POS label

## 3.10 # Interpretation of the thesis # Character-Aware Neural Language Models

- Character based word embedding
- Using convolution 、highway network and LSTM

# 4. depth CNN For text categorization

## 4.1 Deep convolution network is used for text classification

**The starting point**： Sequence model (LSTMs) stay NLP In the middle of the world ; also CNNs、 Attention, etc , But all the models are basically not very deep —— Unlike the depth model in computer vision- When we NLP What happens when building a vision like system
- Work from character level

## 4.2 VD-CNN structure

- In the whole system and visual neural network model VGG and ResNet The structure is a bit like

- Not much like a typical deep learning NLP System

- The result is a fixed size , Because the text is truncated or filled to a uniform length

- Each stage has a local pooling operation , Number of features double

## 4.3 VD-CNN Convolution module

- Each convolution block is two convolution layers , Behind each convolution is BatchNorm And a ReLU
- The convolution size is 3
- pad To maintain ( Or halve in local pooling ) dimension

## 4.4 experimental result

- Use large text classification data sets
- Than NLP The small data sets often used in are much larger , Such as Yoon Kim(2014) The paper of

** Supplementary explanation **

- The above data are error rates , So the lower the better
- Deep network will achieve better results , The residual layer achieves good results , However, the deeper the depth, the less effective it is
- Experiments show that MaxPooling Than KMaxPooling and Use stride Convolution of Two other pooling methods are better
- ConvNets It can help us establish a good text classification system

## 4.5 RNNs It's slow

- RNNs Is depth NLP A very standard building block of
- But their parallelism is very poor , So it's very slow
- idea ： take RNNs and CNNs The best and parallelizable part of

# 5.Q-RNN Model

## 5.1 Quasi-Recurrent Neural Network

- Try to combine the advantages of the two model families
- Time parallel convolution , Convolution calculation , Forgetting gate and output gate

- The element by element gated pseudo recursion of cross channel parallelism is completed in the pooling layer

## 5.2 Q-RNN experiment ： Language model

## 5.3 Q-RNNs： Sentiment analysis

- Often than LSTMs Better, faster
- Can explain better

## 5.4 QRNN The limitation of

- For character level LMs Don't like LSTMs That works
- Longer dependency problems encountered in modeling

- A deeper network is usually needed to get access to LSTM Same good performance
- As they go deeper , Still faster
- Effectively use depth as an alternative to true recursion

## 5.5 RNN The shortcomings of &Transformer Proposed motivation

- We hope to speed up in parallel , but RNN It's serial

- Even though GRUs and LSTMs,RNNs Long term dependence can be captured through the attention mechanism , But as the sequence grows , The paths that need to be calculated are also growing

- If the attention mechanism itself allows us to focus on information anywhere , Maybe we don't need RNN？

# 6. Video tutorial

You can click on the **B standing ** View the video 【 Bilingual subtitles 】 edition

# 7. Reference material

- This lecture is for learning
**Read the flip book online** - 《 Stanford CS224n Deep learning and natural language processing 》
**Course study guide** - 《 Stanford CS224n Deep learning and natural language processing 》
**Course assignment analysis** - 【
**Bilingual subtitle video**】 Stanford CS224n | Deep learning and natural language processing (2019· whole 20 speak ) **Stanford Official website**| CS224n: Natural Language Processing with Deep Learning

**ShowMeAI** A series of tutorials are recommended

- The technology of large factories realizes | Recommendation and advertising computing solutions
- The technology of large factories realizes | Computer vision solutions
- The technology of large factories realizes | Natural language processing industry solutions
- The illustration Python Programming ： From introduction to mastery
- Graphical data analysis ： From introduction to mastery
- The illustration AI Mathematical basis ： From introduction to mastery
- Illustrate big data technology ： From introduction to mastery
- Graphical machine learning algorithm ： From introduction to mastery
- Machine learning practice ： Teach you to play machine learning series hand in hand
- In depth learning course | Wu Enda special course · A full set of notes
- Natural language processing tutorial | Stanford CS224n Course · Course teaching and interpretation of a full set of notes

版权声明

本文为[ShowMeAI]所创，转载请带上原文链接，感谢

https://cdmana.com/2022/134/202205141312284819.html