TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
- URL: http://arxiv.org/abs/2211.10705v5
- Date: Thu, 10 Aug 2023 09:27:44 GMT
- Title: TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
- Authors: Zhiyang Dou, Qingxuan Wu, Cheng Lin, Zeyu Cao, Qiangqiang Wu, Weilin
Wan, Taku Komura, Wenping Wang
- Abstract summary: We introduce a set of simple yet effective TOken REduction strategies for Transformer-based Human Mesh Recovery from monocular images.
We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature.
Our method massively reduces the number of tokens involved in high-complexity interactions in the Transformer.
- Score: 34.46696132157042
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we introduce a set of simple yet effective TOken REduction
(TORE) strategies for Transformer-based Human Mesh Recovery from monocular
images. Current SOTA performance is achieved by Transformer-based structures.
However, they suffer from high model complexity and computation cost caused by
redundant tokens. We propose token reduction strategies based on two important
aspects, i.e., the 3D geometry structure and 2D image feature, where we
hierarchically recover the mesh geometry with priors from body structure and
conduct token clustering to pass fewer but more discriminative image feature
tokens to the Transformer. Our method massively reduces the number of tokens
involved in high-complexity interactions in the Transformer. This leads to a
significantly reduced computational cost while still achieving competitive or
even higher accuracy in shape recovery. Extensive experiments across a wide
range of benchmarks validate the superior effectiveness of the proposed method.
We further demonstrate the generalizability of our method on hand mesh
recovery. Visit our project page at
https://frank-zy-dou.github.io/projects/Tore/index.html.
Related papers
- Enhancing 3D Transformer Segmentation Model for Medical Image with Token-level Representation Learning [9.896550384001348]
This work proposes a token-level representation learning loss that maximizes agreement between token embeddings from different augmented views individually.
We also invent a simple "rotate-and-restore" mechanism, which rotates and flips one augmented view of input volume, and later restores the order of tokens in the feature maps.
We test our pre-training scheme on two public medical segmentation datasets, and the results on the downstream segmentation task show more improvement of our methods than other state-of-the-art pre-trainig methods.
arXiv Detail & Related papers (2024-08-12T01:49:13Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - PPT: Token Pruning and Pooling for Efficient Vision Transformers [7.792045532428676]
We propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT)
PPT integrates both token pruning and token pooling techniques in ViTs without additional trainable parameters.
It reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
arXiv Detail & Related papers (2023-10-03T05:55:11Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Dual Vision Transformer [114.1062057736447]
We propose a novel Transformer architecture that aims to mitigate the cost issue, named Dual Vision Transformer (Dual-ViT)
The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity.
We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with reduced training complexity.
arXiv Detail & Related papers (2022-07-11T16:03:44Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Sliced Recursive Transformer [23.899076070924153]
Recursive operation on vision transformers can improve parameter utilization without involving additional parameters.
Our model Sliced Recursive Transformer (SReT) is compatible with a broad range of other designs for efficient vision transformers.
arXiv Detail & Related papers (2021-11-09T17:59:14Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.