An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
- URL: http://arxiv.org/abs/2406.09415v1
- Date: Thu, 13 Jun 2024 17:59:58 GMT
- Title: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
- Authors: Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen,
- Abstract summary: vanilla Transformers can operate by treating each individual pixel as a token and achieve highly performant results.
We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision.
- Score: 65.64402188506644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.
Related papers
- Semantic Segmentation Enhanced Transformer Model for Human Attention
Prediction [8.47446520519624]
Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image.
Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs.
We propose a Transformer-based method with semantic segmentation as another learning objective.
arXiv Detail & Related papers (2023-01-26T10:27:51Z) - Knowledge Distillation via the Target-aware Transformer [83.03578375615614]
We propose a novel one-to-all spatial matching knowledge distillation approach.
Specifically, we allow each pixel of the teacher feature to be distilled to all spatial locations of the student features.
Our approach surpasses the state-of-the-art methods by a significant margin on various computer vision benchmarks.
arXiv Detail & Related papers (2022-05-22T10:26:54Z) - Masked Visual Pre-training for Motor Control [118.18189211080225]
Self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels.
We freeze the visual encoder and train neural network controllers on top with reinforcement learning.
This is the first self-supervised model to exploit real-world images at scale for motor control.
arXiv Detail & Related papers (2022-03-11T18:58:10Z) - Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding.
We show how to adapt vision transformers for self-supervised monocular depth estimation.
Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z) - An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.