Related papers: Efficient Vision Transformer for Human Pose Estimation via Patch Selection

Efficient Vision Transformer for Human Pose Estimation via Patch Selection

URL: http://arxiv.org/abs/2306.04225v2
Date: Wed, 22 Nov 2023 12:35:08 GMT
Title: Efficient Vision Transformer for Human Pose Estimation via Patch Selection
Authors: Kaleab A. Kinfu and Rene Vidal
Abstract summary: Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. We propose three methods for reducing ViT's computational complexity, which are based on selecting and processing a small number of most informative patches. Our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%.
Score: 1.450405446885067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic computational complexity of ViTs has limited their applicability for processing high-resolution images. In this paper, we propose three methods for reducing ViT's computational complexity, which are based on selecting and processing a small number of most informative patches while disregarding others. The first two methods leverage a lightweight pose estimation network to guide the patch selection process, while the third method utilizes a set of learnable joint tokens to ensure that the selected patches contain the most important information about body joints. Experiments across six benchmarks show that our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%.

Related papers

Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation [34.99437411281915]
This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-02-28T22:34:22Z)
SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation [15.811141677039224]
State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation. However, plain vision transformers encounter challenges due to their neglect of local features and their high computational complexity. We propose SegStitch, an innovative architecture that integrates transformers with denoising ODE blocks.
arXiv Detail & Related papers (2024-08-01T12:05:02Z)
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos. Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks. Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z)
Data-Side Efficiencies for Lightweight Convolutional Neural Networks [4.5853328688992905]
We show how four data attributes - number of classes, object color, image resolution, and object scale affect neural network model size and efficiency. We provide an example, applying the metrics and methods to choose a lightweight model for a robot path planning application.
arXiv Detail & Related papers (2023-08-24T19:50:25Z)
ConcatPlexer: Additional Dim1 Batching for Faster ViTs [31.239412320401467]
We propose a novel approach for efficient visual recognition that employs additional dim1 domain (i.e., concatenation) components. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel inference to overcome its weaknesses. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% accuracy, respectively.
arXiv Detail & Related papers (2023-08-22T05:21:31Z)
Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features. Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z)
Sample and Computation Redistribution for Efficient Face Detection [137.19388513633484]
Training data sampling and computation distribution strategies are the keys to efficient and accurate face detection. scrfdf34 outperforms the best competitor, TinaFace, by $3.86%$ (AP at hard set) while being more than emph3$times$ faster on GPUs with VGA-resolution images.
arXiv Detail & Related papers (2021-05-10T23:51:14Z)
Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects. We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers. We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z)
Displacement-Invariant Cost Computation for Efficient Stereo Matching [122.94051630000934]
Deep learning methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy. But their inference time is typically slow, on the order of seconds for a pair of 540p images. We propose a emphdisplacement-invariant cost module to compute the matching costs without needing a 4D feature volume.
arXiv Detail & Related papers (2020-12-01T23:58:16Z)
Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image. Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space. We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step. This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
Human Body Model Fitting by Learned Gradient Descent [48.79414884222403]
We propose a novel algorithm for the fitting of 3D human shape to images. We show that this algorithm is fast (avg. 120ms convergence), robust to dataset, and achieves state-of-the-art results on public evaluation datasets.
arXiv Detail & Related papers (2020-08-19T14:26:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.