Related papers: HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation

HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation

URL: http://arxiv.org/abs/2505.04276v1
Date: Wed, 07 May 2025 09:26:37 GMT
Title: HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation
Authors: Yajie Fu, Chaorui Huang, Junwei Li, Hui Kong, Yibin Tian, Huakang Li, Zhiyuan Zhang,
Abstract summary: HDiffTG is a novel 3D Human Pose (3DHCN) method that integrates Transformer, Graph Convolutional Network (GCN), and diffusion model into a unified framework.<n>We show that HDiffTG significantly improves pose estimation accuracy and robustness while maintaining a lightweight design.
Score: 21.823965837699166
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose HDiffTG, a novel 3D Human Pose Estimation (3DHPE) method that integrates Transformer, Graph Convolutional Network (GCN), and diffusion model into a unified framework. HDiffTG leverages the strengths of these techniques to significantly improve pose estimation accuracy and robustness while maintaining a lightweight design. The Transformer captures global spatiotemporal dependencies, the GCN models local skeletal structures, and the diffusion model provides step-by-step optimization for fine-tuning, achieving a complementary balance between global and local features. This integration enhances the model's ability to handle pose estimation under occlusions and in complex scenarios. Furthermore, we introduce lightweight optimizations to the integrated model and refine the objective function design to reduce computational overhead without compromising performance. Evaluation results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HDiffTG achieves state-of-the-art (SOTA) performance on the MPI-INF-3DHP dataset while excelling in both accuracy and computational efficiency. Additionally, the model exhibits exceptional robustness in noisy and occluded environments. Source codes and models are available at https://github.com/CirceJie/HDiffTG

Related papers

Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z)
On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation [52.96632954620623]
We introduce a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers.<n>Our approach sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models.
arXiv Detail & Related papers (2025-05-28T15:08:36Z)
Dynamic 3D KAN Convolution with Adaptive Grid Optimization for Hyperspectral Image Classification [12.168520751389622]
KANet is an improved 3D-DenseNet model, consisting of 3D KAN Conv and an adaptive grid update mechanism.<n> KANet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width.<n>The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
arXiv Detail & Related papers (2025-04-21T14:57:48Z)
UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting [57.63613048492219]
We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs)<n>This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses.
arXiv Detail & Related papers (2025-04-02T22:17:30Z)
Factorized Implicit Global Convolution for Automotive Computational Fluid Dynamics Prediction [52.32698071488864]
We propose Factorized Implicit Global Convolution (FIGConv), a novel architecture that efficiently solves CFD problems for very large 3D meshes.<n>FIGConv achieves quadratic complexity $O(N2)$, a significant improvement over existing 3D neural CFD models.<n>We validate our approach on the industry-standard Ahmed body dataset and the large-scale DrivAerNet dataset.
arXiv Detail & Related papers (2025-02-06T18:57:57Z)
Hyperspectral Images Efficient Spatial and Spectral non-Linear Model with Bidirectional Feature Learning [7.06787067270941]
We propose a novel framework that significantly reduces data volume while enhancing classification accuracy.<n>Our model employs a bidirectional reversed convolutional neural network (CNN) to efficiently extract spectral features, complemented by a specialized block for spatial feature analysis.
arXiv Detail & Related papers (2024-11-29T23:32:26Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction [5.285847977231642]
3D semantic occupancy prediction is crucial for ensuring the safety in autonomous driving. Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features. We propose OccLoff, a framework that Learns to optimize Feature Fusion for 3D occupancy prediction.
arXiv Detail & Related papers (2024-11-06T06:34:27Z)
Lightweight Deep Learning Framework for Accurate Particle Flow Energy Reconstruction [8.598010350935596]
This paper systematically evaluates a deep learning reconstruction framework.<n>We design a hybrid loss function combining weighted mean squared with error structural similarity index.<n>We enhance the model's capability to capture cross-modaltemporal correlations and energy-displacement nonlinearities.
arXiv Detail & Related papers (2024-10-08T11:49:18Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion [54.0238087499699]
We show that diffusion models enhance the accuracy, robustness, and coherence of human pose estimations. We introduce DiffHPE, a novel strategy for harnessing diffusion models in 3D-HPE. Our findings indicate that while standalone diffusion models provide commendable performance, their accuracy is even better in combination with supervised models.
arXiv Detail & Related papers (2023-09-04T12:54:10Z)
Learned Vertex Descent: A New Direction for 3D Human Model Fitting [64.04726230507258]
We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. Our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.
arXiv Detail & Related papers (2022-05-12T17:55:51Z)
Generative Multi-Stream Architecture For American Sign Language Recognition [15.717424753251674]
Training on datasets with low feature-richness for complex applications limit optimal convergence below human performance. We propose a generative multistream architecture, eliminating the need for additional hardware with the intent to improve feature convergence without risking impracticability. Our methods have achieved 95.62% validation accuracy with a variance of 1.42% from training, outperforming past models by 0.45% in validation accuracy and 5.53% in variance.
arXiv Detail & Related papers (2020-03-09T21:04:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.