Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens
- URL: http://arxiv.org/abs/2412.04680v1
- Date: Fri, 06 Dec 2024 00:38:36 GMT
- Title: Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens
- Authors: Jaihyun Lew, Soohyuk Jang, Jaehoon Lee, Seungryong Yoo, Eunji Kim, Saehyung Lee, Jisoo Mok, Siwon Kim, Sungroh Yoon,
- Abstract summary: We propose to substitute the grid-based tokenization in Vision Transformer with superpixel tokenization.
Our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.
- Score: 38.31045722878938
- License:
- Abstract: Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.
Related papers
- A Spitting Image: Modular Superpixel Tokenization in Vision Transformers [0.0]
Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image.
We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction.
arXiv Detail & Related papers (2024-08-14T17:28:58Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Long-Range Grouping Transformer for Multi-View 3D Reconstruction [9.2709012704338]
Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed.
An effective and efficient encoder can be established which connects inter-view features.
A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
arXiv Detail & Related papers (2023-08-17T01:34:59Z) - Expediting Large-Scale Vision Transformer for Dense Prediction without
Fine-tuning [28.180891300826165]
Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers.
We present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens.
Results are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.
arXiv Detail & Related papers (2022-10-03T15:49:48Z) - Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer [91.49837514935051]
We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer)
TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes.
Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
arXiv Detail & Related papers (2022-04-19T05:38:16Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.