Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer
- URL: http://arxiv.org/abs/2204.08680v3
- Date: Thu, 21 Apr 2022 14:50:57 GMT
- Title: Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer
- Authors: Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang,
and Xiaogang Wang
- Abstract summary: We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer)
TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes.
Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
- Score: 91.49837514935051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have achieved great successes in many computer vision
tasks. Most methods generate vision tokens by splitting an image into a regular
and fixed grid and treating each cell as a token. However, not all regions are
equally important in human-centric vision tasks, e.g., the human body needs a
fine representation with many tokens, while the image background can be modeled
by a few tokens. To address this problem, we propose a novel Vision
Transformer, called Token Clustering Transformer (TCFormer), which merges
tokens by progressive clustering, where the tokens can be merged from different
locations with flexible shapes and sizes. The tokens in TCFormer can not only
focus on important areas but also adjust the token shapes to fit the semantic
concept and adopt a fine resolution for regions containing critical details,
which is beneficial to capturing detailed information. Extensive experiments
show that TCFormer consistently outperforms its counterparts on different
challenging human-centric tasks and datasets, including whole-body pose
estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is
available at https://github.com/zengwang430521/TCFormer.git
Related papers
- ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - Long-Range Grouping Transformer for Multi-View 3D Reconstruction [9.2709012704338]
Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed.
An effective and efficient encoder can be established which connects inter-view features.
A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
arXiv Detail & Related papers (2023-08-17T01:34:59Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - UMIFormer: Mining the Correlations between Similar Tokens for Multi-View
3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer)
It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification.
All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.