In a previous post, we discussed how the Transformer architecture took the NLP world by storm, and how it’s everywhere now from Bert to ChatGPT and GPT-4. Will they work for vision? This was what a team at Google wondered and tried the simplest thing possible: split the image into patches and treat the patches as if they were words of a sentence This is the Vision Transformer pictured below (image from the original paper):


Vision Transformer pictured


The image is broken up into 16X16 patches and then the patches are fed in as input via a linear projection. Since all two-dimensional information about the image is lost, positional information is tagged on.

Surprisingly, this very simple adaptation leads to results that beat the best CNNs! Provided there is enough data: when trained with the ImageNet-21k dataset, the Vision Transformer beats all previous methods on multiple image recognition benchmarks. Thus, a picture is worth 16X16 words!

What’s going on? How can it beat the carefully crafted CNN architectures that are designed to take into account the translational invariance of natural images? This is another example of Richard Sutton’s bitter lesson of AI “the great power of general purpose methods, of methods that continue to scale with increased computation”.  When enough data and compute is available, the Vision Transformer overcomes the inductive bias of the CNN architectures. Andrej Karpathy described the Transformer as a “general purpose differentiable machine” and this is illustrated yet again with the Vision Transformer.

When enough data or compute is not available, then the inductive bias of convolutions in CNNs will have advantage.  Also, the Vision Transformer  is a large model and is going to be demanding resource-wise. So, to benefit from it on small resource constrained devices, we need to optimize it in a hardware-aware fashion. The architecture is very heterogeneous with different types of components – the central multi-head attention, the MLP layers, the embedding layers etc. When applied with vision, we also have additional parameters to tune such as  number of pixels per token and size of positional embedding. Thus there are many opportunities for optimization and this is where Embedl Comes in!

Stay tuned for more insights into transformers in our upcoming post!

You may also like

What do Vision Transformers See?
What do Vision Transformers See?
22 November, 2023

CNNs have long been the workhorses of vision ever since they achieved the dramatic breakthroughs of super-human performa...

Keys to Success: Algorithms, Optimization and a World Class Team!
Keys to Success: Algorithms, Optimization and a World Class Team!
18 August, 2023

The results are in from the Edge AI and Vision Alliance's 9th annual Computer Vision Developer Survey 2022. For the past...

Making Transformers Efficient
Making Transformers Efficient
14 July, 2023

While the Transformer architecture has taken the natural language processing (NLP) world by storm – witness the amazing ...