What do Vision Transformers See?

CNNs have long been the workhorses of vision ever since they achieved
the dramatic breakthroughs of super-human performance with AlexNet
in 2012. But recently, the vision transformer (ViT) is changing the picture.

CNNs have a an inductive spatial bias baked into them with
convolutional kernel whereas vision transformers are based on a
much more general architecture. In fact, the first vision transformers
used an architecture from NLP tasks without change and simply
chopped up the input image into a sequence of patches in the most
naïve way possible. Nevertheless they beat CNNs by overcoming the
spatial bias given enough data. This may be another example of Rich
Sutton’s famous “bitter lesson” of AI: “building in how we think we think
does not work in the long run … breakthrough progress eventually
arrives by an opposing approach based on scaling computation by
search and learning..”

It turns out that vision transformers “see” very differently from CNNs. A
team from Google Brain studied the representations produced by the
two architectures very carefully. While it is folklore that CNNs start with
very low level local information and gradually build up more global
structures in the deeper layers, ViTs already have global information at
the earliest layer thanks to global self-attention. As pithily summarized in
a Quanta artcle, “If a CNN’s approach is like starting at a single pixel and
zooming out, a transformer slowly brings the whole fuzzy image into
focus.”. Another interesting observation to emerge from that study is that
skip connections are very important for ViTs.

ViTs are causing great excitement for several reasons, besides
overcoming the performance of CNNs. Because of their general
purpose architecture, they offer the potential for a single uniform solution
to all vision tasks at one go, rather than crafting different solutions for
different tasks. While previous approaches had to handle different types
of relationships – pixel to pixel versus pixel to object or object to object –
differently, transformers can handle all these different relationships
uniformly in the same way. Another aspect that is becoming increasingly
important is that this uniformity means that multi-modal inputs are also
very well suited to transformers – so image and text inputs can be
handled in the same model.

So we are soon entering the era of “Foundation Models” in vision and
multi-modal inputs just as the “foundational” GPT style models for NLP.
These behemoths will be hundreds of billions of parameters dwarfing the
previous ResNet models with tens of millions of parameters.
Which means that model compression will be ever more important to get
the benefits of these large models on small edge devices! Enter Embedl!
The good news is that our experiments and several recent papers have
shown that compression methods such as pruning and especially
quantization seem to be much more effective for ViTs than they were for
CNNs. In recent pilot projects with the largest tier one suppliers
worldwide, we have recently demonstrated very impressiove results for
compressing the most widely used ViT models.

Please join us in our Webinar on Dec. 7 to learn more about ViTs and
how to optimize them!

Blog-

What do Vision Transformers See?

Like it? Share it:

You may also like

Prompt Engineering in Vision

Keys to Success: Algorithms, Optimization and a World Class Team!

A Picture is Worth … 16X16 Words: The Vision Transformer!