Vision Transformers: Applying Transformer Architectures to Image Recognition and Generation Tasks
Transformers changed natural language processing by replacing recurrence with attention. The same idea now drives many of the strongest computer vision systems through Vision Transformers (ViTs). Instead of treating images as grids of pixels handled mainly by convolutions, ViTs treat an image as a sequence of visual "tokens" and learn...
