编程知识 cdmana.com

One image can hold 16x16 words! ——Transformers for large scale image scaling recognition (a brief review of ICLR 2021 papers)

author |Stan Kriventsov compile |Flin source |medium

In this post , I want to be able to , Explain the author's submission to 2021 ICLR The new paper from the conference “ A graph is equal to 16x16 A word : Transformers for large-scale image recognition ” The meaning of ( So far anonymous ).

In another article , I offer an example , This example takes this new model ( be called Vision Transformer, Vision transformer ) And PyTorch Used together with standards MNIST Data sets for prediction .

since 1960 In depth learning since ( Machine learning uses neural networks to have more than one hidden layer ) It has come out , But what makes deep learning really come to the forefront , yes 2012 In the year AlexNet, A convolution network ( Simply speaking , A network , First look for small patterns in each part of the image , Then try to combine them into a whole picture ), from Alex Krizhevsky Design , Won the year ImageNet The winner of the image classification contest .

Over the next few years , Deep computer vision technology has experienced a real revolution , New convolution architectures emerge every year (GoogleNet、ResNet、DenseNet、EfficientNet etc. ), In the ImageNet And other benchmark datasets ( Such as CIFAR-10、CIFAR-100) Set a new precision record on the .

The figure below shows from 2011 Since then ImageNet The highest accuracy of machine learning models on datasets ( On the first attempt, correctly predict the accuracy of the content of the image ) The progress of .

However , In the past few years , The most interesting development of deep learning is not in the field of images , It's in natural language processing (NLP) in , This is from Ashish Vaswani Et al. 2017 Year paper “ Attention is everything you need ” For the first time .

The thought of attention , It refers to the weight that can be trained , Simulate the importance of each connection between different parts of the input sentence , Yes NLP The effect of this is similar to the convolution network in computer vision , It greatly improves the machine learning model for various language tasks ( Such as natural language understanding ) And the effect of machine translation .

What makes attention so effective for linguistic data , Because understanding human language often requires tracking long-term dependencies . We might start by saying “ We arrived in New York ”, And said, “ The weather in the city is fine ”. For any human reader , It should be clear , In the last sentence “ City ” refer to “ New York ”, But for one based only on nearby data ( Like convolutional networks ) Find the model of the pattern in , This connection may not be detectable .

The problem of long-term dependence can be solved by using recursive networks , for example LSTMs, Before the transformer comes ,LSTMs It's actually NLP The top model in , But even those models , It's also hard to match specific words .

The global attention model in the transformer measures the importance of each connection between any two words in the text , This explains the advantages of their performance . For sequential data types that are less important to attention ( for example , Time domain data such as daily sales or stock prices ), Recursive networks are still very competitive , It may still be the best choice .

Although in NLP In the equal sequence model , Dependencies between distant objects may have special significance , But in the image task , They can't be ignored . To form a complete picture , You usually need to understand the various parts of the image .

up to now , The reason why attention models don't perform well in computer vision is the difficulty of scaling them ( They are scaled to N², therefore 1000x1000 The full set of attention weights between the pixels of an image will have a million items ).

Maybe more importantly , in fact , Contrary to the words in the text , The pixels in the picture themselves are not very meaningful , So connecting them through attention doesn't make much sense .

This new paper proposes a method , It doesn't care about pixels , It's about small areas of the image ( Maybe it's in the title 16x16, Although the optimal block size actually depends on the image size and content of the model ).

The picture above ( From the paper ) Shows how the visual transformer works .

Each color block in the input image is flattened by using a linear projection matrix , And add location embedding to it ( Learning numbers , It contains information about the initial position of the color block in the image ). This is necessary , Because the transformer will process all the inputs , Regardless of the order , Therefore, it is helpful for the model to evaluate the attention weight correctly . Additional class tags are connected to the input ( Position in the image 0), As a placeholder for the class to predict in the classification task .

Be similar to 2017 edition , The transformer encoder consists of multiple attention , Normalized and fully connected layer composition , These layers have residuals ( skip ) Connect , As shown in the right half of the figure .

In each area of interest , Multiple headers can capture different connection patterns . If you are interested in learning more about transformers , I suggest reading Jay Alammar I wrote this wonderful article .

The output is fully connected MLP The header can provide the desired category prediction . Of course , As it is today , The master model can be pre trained on large image data sets , Then you can use standard transfer learning methods to get the final MLP Head fine tuning for specific tasks .

One of the features of the new model is , Although according to the research in this paper , It is more effective than convolution method to obtain the same prediction accuracy with less computation , But as it receives more and more data training , Its performance seems to be improving , This is more than any other model .

The author of this article contains 3 Billion private googlejft-300M Visual converter images are trained on the dataset , This leads to state-of-the-art accuracy in many benchmarks . One can expect this pre trained model to be released soon , So that we can all try .

See the new application of neural attention in the field of computer vision , It's so exciting ! Hopefully in the next few years , On the basis of this development , Can make greater progress !

Link to the original text :https://medium.com/swlh/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-brief-review-of-the-8770a636c6a8

Welcome to join us AI Blog station : http://panchuang.net/

sklearn Machine learning Chinese official documents : http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/

版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢

Scroll to Top