Related papers: PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

URL: http://arxiv.org/abs/2508.02806v1
Date: Mon, 04 Aug 2025 18:23:31 GMT
Title: PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation
Authors: Zongyou Yang, Jonathan Loo,
Abstract summary: This study aims to deeply optimize and improve the existing Pymaf network architecture.<n>The new PyCAT4 model is validated through experiments on the COCO and 3DPW datasets.
Score: 0.8149086480055433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network's detection capability in human pose estimation, further advancing the development of human pose estimation technology.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
PEP-GS: Perceptually-Enhanced Precise Structured 3D Gaussians for View-Adaptive Rendering [3.1006820631993515]
3D Gaussian Splatting (3D-GS) has achieved significant success in real-time, high-quality 3D scene rendering.<n>We introduce PEP-GS, a perceptually-enhanced framework that dynamically predicts Gaussian attributes, including opacity, color, and covariance.<n>We show that PEP-GS outperforms state-of-the-art methods, particularly in challenging scenarios involving view-dependent effects and fine-scale details.
arXiv Detail & Related papers (2024-11-08T17:42:02Z)
Learning Global and Local Features of Power Load Series Through Transformer and 2D-CNN: An Image-based Multi-step Forecasting Approach Incorporating Phase Space Reconstruction [1.9458156037869137]
This study proposes a novel multi-step forecasting approach by delicately integrating the PSR with neural networks to establish an end-to-end learning system. A novel deep learning model, namely PSR-GALIEN, is designed, in which the Transformer and 2D-CNN are employed for the extraction of the global and local patterns in the image. The results show that, compared with six state-of-the-art deep learning models, the forecasting performance of PSR-GALIEN consistently surpasses these baselines.
arXiv Detail & Related papers (2024-07-16T09:59:13Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds [6.69660410213287]
We propose an innovative framework called Point-MGE to explore the benefits of deeply integrating 3D representation learning and generative learning. In shape classification, Point-MGE achieved an accuracy of 94.2% (+1.0%) on the ModelNet40 dataset and 92.9% (+5.5%) on the ScanObjectNN dataset. Experimental results also confirmed that Point-MGE can generate high-quality 3D shapes in both unconditional and conditional settings.
arXiv Detail & Related papers (2024-06-25T07:57:03Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles [0.0]
This work presents an innovative approach to validate video content. The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks. Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfakes generation.
arXiv Detail & Related papers (2023-10-25T06:00:37Z)
EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors [72.33767389878473]
We propose a transformer-based model EvoPose to introduce the human body prior knowledge for 3D human pose estimation effectively. A Structural Priors Representation (SPR) module represents human priors as structural features carrying rich body patterns. A Recursive Refinement (RR) module is applied to the 3D pose outputs by utilizing estimated results and further injects human priors simultaneously.
arXiv Detail & Related papers (2023-06-16T04:09:16Z)
Learned Vertex Descent: A New Direction for 3D Human Model Fitting [64.04726230507258]
We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. Our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method.
arXiv Detail & Related papers (2022-05-12T17:55:51Z)
BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively. We consider different backbone architectures and pre-train them using the DINO self-supervised learning method. Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z)
Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS) The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z)
Feature-level augmentation to improve robustness of deep neural networks to affine transformations [22.323625542814284]
Recent studies revealed that convolutional neural networks do not generalize well to small image transformations. We propose to introduce data augmentation at intermediate layers of the neural architecture. We develop the capacity of the neural network to cope with such transformations.
arXiv Detail & Related papers (2022-02-10T17:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.