GADS: A Super Lightweight Model for Head Pose Estimation
- URL: http://arxiv.org/abs/2504.15751v1
- Date: Tue, 22 Apr 2025 09:53:25 GMT
- Title: GADS: A Super Lightweight Model for Head Pose Estimation
- Authors: Menan Velayuthan, Asiri Gawesha, Purushoth Velayuthan, Nuwan Kodagoda, Dharshana Kasthurirathna, Pradeepa Samarasinghe,
- Abstract summary: Grouped Attention Deep Sets (GADS) is a novel architecture based on the Deep Set framework.<n>By grouping landmarks into regions, we reduce computational complexity.<n>Our model is $7.5times$ smaller and executes $25times$ faster than the current lightest state-of-the-art model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In human-computer interaction, head pose estimation profoundly influences application functionality. Although utilizing facial landmarks is valuable for this purpose, existing landmark-based methods prioritize precision over simplicity and model size, limiting their deployment on edge devices and in compute-poor environments. To bridge this gap, we propose \textbf{Grouped Attention Deep Sets (GADS)}, a novel architecture based on the Deep Set framework. By grouping landmarks into regions and employing small Deep Set layers, we reduce computational complexity. Our multihead attention mechanism extracts and combines inter-group information, resulting in a model that is $7.5\times$ smaller and executes $25\times$ faster than the current lightest state-of-the-art model. Notably, our method achieves an impressive reduction, being $4321\times$ smaller than the best-performing model. We introduce vanilla GADS and Hybrid-GADS (landmarks + RGB) and evaluate our models on three benchmark datasets -- AFLW2000, BIWI, and 300W-LP. We envision our architecture as a robust baseline for resource-constrained head pose estimation methods.
Related papers
- Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z) - A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning [29.682282730123234]
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI.<n>We propose A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline.<n>Our framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks.
arXiv Detail & Related papers (2025-12-16T14:27:47Z) - RoFt-Mol: Benchmarking Robust Fine-Tuning with Molecular Graph Foundation Models [15.62650736139546]
We classify eight fine-tuning methods into three mechanisms: weight-based, representation-based, and partial fine-tuning.<n>We benchmark these methods on downstream regression and classification tasks across supervised and self-supervised pre-trained models in diverse labeling settings.<n>This evaluation provides valuable insights and informs the design of a refined robust fine-tuning method, ROFT-MOL.
arXiv Detail & Related papers (2025-08-30T21:35:57Z) - Scalable Object Detection in the Car Interior With Vision Foundation Models [42.958409172092225]
We propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding.<n>Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud.<n>To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.<n>Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_score$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%.
arXiv Detail & Related papers (2025-08-27T07:58:57Z) - ARMO: Autoregressive Rigging for Multi-Category Objects [8.030479370619458]
We introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information.<n>Unlike traditional benchmarks that rely on predefined standard poses, our dataset embraces diverse shape categories, styles, and poses.<n>We propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner.
arXiv Detail & Related papers (2025-03-26T15:56:48Z) - HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs [15.575498324678373]
A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices.<n>In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition.
arXiv Detail & Related papers (2025-02-02T20:23:32Z) - SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z) - On-Road Object Importance Estimation: A New Dataset and A Model with Multi-Fold Top-Down Guidance [70.80612792049315]
This paper contributes a new large-scale dataset named Traffic Object Importance (TOI)
It proposes a model that integrates multi-fold top-down guidance with the bottom-up feature.
Our model outperforms state-of-the-art methods by large margins.
arXiv Detail & Related papers (2024-11-26T06:37:10Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs)
We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up.
In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - How to train your draGAN: A task oriented solution to imbalanced
classification [15.893327571516016]
This paper proposes a unique, performance-oriented, data-generating strategy that utilizes a new architecture, coined draGAN.
The samples are generated with the objective of optimizing the classification model's performance, rather than similarity to the real data.
Empirically we show the superiority of draGAN, but also highlight some of its shortcomings.
arXiv Detail & Related papers (2022-11-18T07:37:34Z) - Constructing Stronger and Faster Baselines for Skeleton-based Action
Recognition [19.905455701387194]
We present an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition.
On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other State-Of-The-Art (SOTA) methods.
arXiv Detail & Related papers (2021-06-29T07:09:11Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z) - $S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular
Videos and Synthetic Data [11.489124536853172]
$S3$Net is a self-supervised framework which combines synthetic and real-world images for training.
We present a unique way to train this self-supervised framework, and achieve (i.e.) more than $15%$ improvement over previous synthetic supervised approaches.
arXiv Detail & Related papers (2020-07-28T22:40:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.