Related papers: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

URL: http://arxiv.org/abs/2406.09415v1
Date: Thu, 13 Jun 2024 17:59:58 GMT
Title: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Authors: Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen,
Abstract summary: vanilla Transformers can operate by treating each individual pixel as a token and achieve highly performant results. We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision.
Score: 65.64402188506644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

Related papers

Sensitive Image Classification by Vision Transformers [1.9598097298813262]
Vision transformer models leverage a self-attention mechanism to capture global interactions among contextual local elements. In our study, we conducted a comparative analysis between various popular vision transformer models and traditional pre-trained ResNet models. The findings demonstrated that vision transformer networks surpassed the benchmark pre-trained models, showcasing their superior classification and detection capabilities.
arXiv Detail & Related papers (2024-12-21T02:34:24Z)
Semantic Segmentation Enhanced Transformer Model for Human Attention Prediction [8.47446520519624]
Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image. Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs. We propose a Transformer-based method with semantic segmentation as another learning objective.
arXiv Detail & Related papers (2023-01-26T10:27:51Z)
PatchRot: A Self-Supervised Technique for Training Vision Transformers [22.571734100855046]
Vision transformers require a huge amount of labeled data to outperform convolutional neural networks. We propose a self-supervised technique PatchRot that is crafted for vision transformers. Our experiments on different datasets showcase PatchRot training learns rich features which outperform supervised learning and compared baseline.
arXiv Detail & Related papers (2022-10-27T18:55:12Z)
Knowledge Distillation via the Target-aware Transformer [83.03578375615614]
We propose a novel one-to-all spatial matching knowledge distillation approach. Specifically, we allow each pixel of the teacher feature to be distilled to all spatial locations of the student features. Our approach surpasses the state-of-the-art methods by a significant margin on various computer vision benchmarks.
arXiv Detail & Related papers (2022-05-22T10:26:54Z)
Masked Visual Pre-training for Motor Control [118.18189211080225]
Self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels. We freeze the visual encoder and train neural network controllers on top with reinforcement learning. This is the first self-supervised model to exploit real-world images at scale for motor control.
arXiv Detail & Related papers (2022-03-11T18:58:10Z)
Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding. We show how to adapt vision transformers for self-supervised monocular depth estimation. Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [112.94212299087653]
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
arXiv Detail & Related papers (2020-10-22T17:55:59Z)
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.