Related papers: Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

URL: http://arxiv.org/abs/2204.08680v3
Date: Thu, 21 Apr 2022 14:50:57 GMT
Title: Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer
Authors: Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang
Abstract summary: We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer) TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
Score: 91.49837514935051
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

Related papers

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens [38.31045722878938]
We propose to substitute the grid-based tokenization in Vision Transformer with superpixel tokenization. Our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.
arXiv Detail & Related papers (2024-12-06T00:38:36Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z)
Long-Range Grouping Transformer for Multi-View 3D Reconstruction [9.2709012704338]
Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed. An effective and efficient encoder can be established which connects inter-view features. A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
arXiv Detail & Related papers (2023-08-17T01:34:59Z)
Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers. Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z)
UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction [9.874357856580447]
We propose a novel transformer network for Unstructured Multiple Images (UMIFormer) It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification. All tokens acquired from various branches are compressed into a fixed-size compact representation.
arXiv Detail & Related papers (2023-02-27T17:27:45Z)
Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks. It may suffer from high redundancy in capturing local features for shallow layers. Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z)
Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%. DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.