author |Stan Kriventsov compile |Flin source |medium
In this post , I want to be able to , Explain the author's submission to 2021 ICLR The new paper from the conference “ A graph is equal to 16x16 A word ： Transformers for large-scale image recognition ” The meaning of （ So far anonymous ）.
In another article , I offer an example , This example takes this new model （ be called Vision Transformer, Vision transformer ） And PyTorch Used together with standards MNIST Data sets for prediction .
since 1960 In depth learning since （ Machine learning uses neural networks to have more than one hidden layer ） It has come out , But what makes deep learning really come to the forefront , yes 2012 In the year AlexNet, A convolution network （ Simply speaking , A network , First look for small patterns in each part of the image , Then try to combine them into a whole picture ）, from Alex Krizhevsky Design , Won the year ImageNet The winner of the image classification contest .
ImageNet Image classification contest ：https://en.wikipedia.org/wiki/ImageNet
Over the next few years , Deep computer vision technology has experienced a real revolution , New convolution architectures emerge every year （GoogleNet、ResNet、DenseNet、EfficientNet etc. ）, In the ImageNet And other benchmark datasets （ Such as CIFAR-10、CIFAR-100） Set a new precision record on the .
The figure below shows from 2011 Since then ImageNet The highest accuracy of machine learning models on datasets （ On the first attempt, correctly predict the accuracy of the content of the image ） The progress of .
However , In the past few years , The most interesting development of deep learning is not in the field of images , It's in natural language processing （NLP） in , This is from Ashish Vaswani Et al. 2017 Year paper “ Attention is everything you need ” For the first time .
- Address of thesis ：https://arxiv.org/abs/1706.03762
The thought of attention , It refers to the weight that can be trained , Simulate the importance of each connection between different parts of the input sentence , Yes NLP The effect of this is similar to the convolution network in computer vision , It greatly improves the machine learning model for various language tasks （ Such as natural language understanding ） And the effect of machine translation .
What makes attention so effective for linguistic data , Because understanding human language often requires tracking long-term dependencies . We might start by saying “ We arrived in New York ”, And said, “ The weather in the city is fine ”. For any human reader , It should be clear , In the last sentence “ City ” refer to “ New York ”, But for one based only on nearby data （ Like convolutional networks ） Find the model of the pattern in , This connection may not be detectable .
The problem of long-term dependence can be solved by using recursive networks , for example LSTMs, Before the transformer comes ,LSTMs It's actually NLP The top model in , But even those models , It's also hard to match specific words .
The global attention model in the transformer measures the importance of each connection between any two words in the text , This explains the advantages of their performance . For sequential data types that are less important to attention （ for example , Time domain data such as daily sales or stock prices ）, Recursive networks are still very competitive , It may still be the best choice .
Although in NLP In the equal sequence model , Dependencies between distant objects may have special significance , But in the image task , They can't be ignored . To form a complete picture , You usually need to understand the various parts of the image .
up to now , The reason why attention models don't perform well in computer vision is the difficulty of scaling them （ They are scaled to N², therefore 1000x1000 The full set of attention weights between the pixels of an image will have a million items ）.
Maybe more importantly , in fact , Contrary to the words in the text , The pixels in the picture themselves are not very meaningful , So connecting them through attention doesn't make much sense .
This new paper proposes a method , It doesn't care about pixels , It's about small areas of the image （ Maybe it's in the title 16x16, Although the optimal block size actually depends on the image size and content of the model ）.
The picture above （ From the paper ） Shows how the visual transformer works .
Each color block in the input image is flattened by using a linear projection matrix , And add location embedding to it （ Learning numbers , It contains information about the initial position of the color block in the image ）. This is necessary , Because the transformer will process all the inputs , Regardless of the order , Therefore, it is helpful for the model to evaluate the attention weight correctly . Additional class tags are connected to the input （ Position in the image 0）, As a placeholder for the class to predict in the classification task .
Be similar to 2017 edition , The transformer encoder consists of multiple attention , Normalized and fully connected layer composition , These layers have residuals （ skip ） Connect , As shown in the right half of the figure .
In each area of interest , Multiple headers can capture different connection patterns . If you are interested in learning more about transformers , I suggest reading Jay Alammar I wrote this wonderful article .
The output is fully connected MLP The header can provide the desired category prediction . Of course , As it is today , The master model can be pre trained on large image data sets , Then you can use standard transfer learning methods to get the final MLP Head fine tuning for specific tasks .
One of the features of the new model is , Although according to the research in this paper , It is more effective than convolution method to obtain the same prediction accuracy with less computation , But as it receives more and more data training , Its performance seems to be improving , This is more than any other model .
The author of this article contains 3 Billion private googlejft-300M Visual converter images are trained on the dataset , This leads to state-of-the-art accuracy in many benchmarks . One can expect this pre trained model to be released soon , So that we can all try .
- Data sets ：https://arxiv.org/abs/1707.02968
See the new application of neural attention in the field of computer vision , It's so exciting ！ Hopefully in the next few years , On the basis of this development , Can make greater progress ！
Welcome to join us AI Blog station ： http://panchuang.net/
sklearn Machine learning Chinese official documents ： http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station ： http://docs.panchuang.net/
本文为[Artificial intelligence meets pioneer]所创，转载请带上原文链接，感谢