GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
- URL: http://arxiv.org/abs/2407.10756v2
- Date: Tue, 16 Jul 2024 14:32:21 GMT
- Title: GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
- Authors: Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, Yong Wang,
- Abstract summary: Group-based Token Pruning Transformer (GTPT) for efficient human pose estimation.
Group-based Token Pruning Transformer (GTPT) for efficient human pose estimation.
- Score: 46.74217876359835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, 2D human pose estimation has made significant progress on public benchmarks. However, many of these approaches face challenges of less applicability in the industrial community due to the large number of parametric quantities and computational overhead. Efficient human pose estimation remains a hurdle, especially for whole-body pose estimation with numerous keypoints. While most current methods for efficient human pose estimation primarily rely on CNNs, we propose the Group-based Token Pruning Transformer (GTPT) that fully harnesses the advantages of the Transformer. GTPT alleviates the computational burden by gradually introducing keypoints in a coarse-to-fine manner. It minimizes the computation overhead while ensuring high performance. Besides, GTPT groups keypoint tokens and prunes visual tokens to improve model performance while reducing redundancy. We propose the Multi-Head Group Attention (MHGA) between different groups to achieve global interaction with little computational overhead. We conducted experiments on COCO and COCO-WholeBody. Compared to other methods, the experimental results show that GTPT can achieve higher performance with less computation, especially in whole-body with numerous keypoints.
Related papers
- ProgRoCC: A Progressive Approach to Rough Crowd Counting [66.09510514180593]
We label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire.
We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC.
Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach.
arXiv Detail & Related papers (2025-04-18T01:57:42Z) - TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era [2.9052912091435923]
High-Energy Physics experiments are facing a multi-fold data increase with every new iteration.
One such step in need of an overhaul is the task of particle track reconstruction, a.k.a., tracking.
A Machine Learning-assisted solution is expected to provide significant improvements.
arXiv Detail & Related papers (2024-07-09T18:47:25Z) - A Manifold Representation of the Key in Vision Transformers [8.938418994111716]
This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key.
Our experiments reveal that decoupling and endowing the key with a manifold structure can enhance the model's performance.
arXiv Detail & Related papers (2024-02-01T12:01:43Z) - MDPose: Real-Time Multi-Person Pose Estimation via Mixture Density Model [27.849059115252008]
We propose a novel framework of single-stage instance-aware pose estimation by modeling the joint distribution of human keypoints.
Our MDPose achieves state-of-the-art performance by successfully learning the high-dimensional joint distribution of human keypoints.
arXiv Detail & Related papers (2023-02-17T08:29:33Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual
Recognition [65.67315418971688]
We show that truncating small eigenvalues of the Global Covariance Pooling (GCP) can attain smoother gradient.
On fine-grained datasets, truncating the small eigenvalues would make the model fail to converge.
Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues.
arXiv Detail & Related papers (2022-05-26T11:41:36Z) - Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation [31.468003041368814]
We employ an Hourglass Network to infer all the keypoints from different persons indiscriminately.
We greedily group the candidate keypoints into multiple human poses, utilizing the predicted guiding offsets.
Our approach is comparable to the state of the art on the challenging COCO dataset under fair conditions.
arXiv Detail & Related papers (2021-07-07T09:32:01Z) - Empirical Evaluation of Pre-trained Transformers for Human-Level NLP:
The Role of Sample Size and Dimensionality [6.540382797747107]
RoBERTa consistently achieves top performance in human-level tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts.
A majority of the tasks achieve results comparable to the best performance with just $frac112$ of the embedding dimensions.
arXiv Detail & Related papers (2021-05-07T20:06:24Z) - Differentiable Multi-Granularity Human Representation Learning for
Instance-Aware Human Semantic Parsing [131.97475877877608]
A new bottom-up regime is proposed to learn category-level human semantic segmentation and multi-person pose estimation in a joint and end-to-end manner.
It is a compact, efficient and powerful framework that exploits structural information over different human granularities.
Experiments on three instance-aware human datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
arXiv Detail & Related papers (2021-03-08T06:55:00Z) - Differentiable Hierarchical Graph Grouping for Multi-Person Pose
Estimation [95.72606536493548]
Multi-person pose estimation is challenging because it localizes body keypoints for multiple persons simultaneously.
We propose a novel differentiable Hierarchical Graph Grouping (HGG) method to learn the graph grouping in bottom-up multi-person pose estimation task.
arXiv Detail & Related papers (2020-07-23T08:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.