Efficient Vision Transformer for Human Pose Estimation via Patch
Selection
- URL: http://arxiv.org/abs/2306.04225v2
- Date: Wed, 22 Nov 2023 12:35:08 GMT
- Title: Efficient Vision Transformer for Human Pose Estimation via Patch
Selection
- Authors: Kaleab A. Kinfu and Rene Vidal
- Abstract summary: Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance.
We propose three methods for reducing ViT's computational complexity, which are based on selecting and processing a small number of most informative patches.
Our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%.
- Score: 1.450405446885067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Convolutional Neural Networks (CNNs) have been widely successful in 2D
human pose estimation, Vision Transformers (ViTs) have emerged as a promising
alternative to CNNs, boosting state-of-the-art performance. However, the
quadratic computational complexity of ViTs has limited their applicability for
processing high-resolution images. In this paper, we propose three methods for
reducing ViT's computational complexity, which are based on selecting and
processing a small number of most informative patches while disregarding
others. The first two methods leverage a lightweight pose estimation network to
guide the patch selection process, while the third method utilizes a set of
learnable joint tokens to ensure that the selected patches contain the most
important information about body joints. Experiments across six benchmarks show
that our proposed methods achieve a significant reduction in computational
complexity, ranging from 30% to 44%, with only a minimal drop in accuracy
between 0% and 3.5%.
Related papers
- SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation [15.811141677039224]
State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation.
However, plain vision transformers encounter challenges due to their neglect of local features and their high computational complexity.
We propose SegStitch, an innovative architecture that integrates transformers with denoising ODE blocks.
arXiv Detail & Related papers (2024-08-01T12:05:02Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - Data-Side Efficiencies for Lightweight Convolutional Neural Networks [4.5853328688992905]
We show how four data attributes - number of classes, object color, image resolution, and object scale affect neural network model size and efficiency.
We provide an example, applying the metrics and methods to choose a lightweight model for a robot path planning application.
arXiv Detail & Related papers (2023-08-24T19:50:25Z) - ConcatPlexer: Additional Dim1 Batching for Faster ViTs [31.239412320401467]
We propose a novel approach for efficient visual recognition that employs additional dim1 domain (i.e., concatenation) components.
We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel inference to overcome its weaknesses.
The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% accuracy, respectively.
arXiv Detail & Related papers (2023-08-22T05:21:31Z) - Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features.
Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z) - Sample and Computation Redistribution for Efficient Face Detection [137.19388513633484]
Training data sampling and computation distribution strategies are the keys to efficient and accurate face detection.
scrfdf34 outperforms the best competitor, TinaFace, by $3.86%$ (AP at hard set) while being more than emph3$times$ faster on GPUs with VGA-resolution images.
arXiv Detail & Related papers (2021-05-10T23:51:14Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy.
But their inference time is typically slow, on the order of seconds for a pair of 540p images.
We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z) - Human Body Model Fitting by Learned Gradient Descent [48.79414884222403]
We propose a novel algorithm for the fitting of 3D human shape to images.
We show that this algorithm is fast (avg. 120ms convergence), robust to dataset, and achieves state-of-the-art results on public evaluation datasets.
arXiv Detail & Related papers (2020-08-19T14:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.