Related papers: End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

URL: http://arxiv.org/abs/2511.13208v1
Date: Mon, 17 Nov 2025 10:19:35 GMT
Title: End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer
Authors: Yonghui Yu, Jiahang Cai, Xun Wang, Wenwu Yang,
Abstract summary: We present a fully end-to-end framework for multi-person 2D pose estimation in videos.<n>A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories.<n>We introduce a novel Pose-Aware VideoErEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and atemporal decoder pose.
Score: 7.19764062839405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single-person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames.Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation.Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a \textbf{6.0} mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video-based approaches, while offering significant gains in efficiency.Project page: https://github.com/zgspose/PAVENet

Related papers

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation [72.89376712495464]
DAGE is a dual-stream transformer that disentangles global coherence from fine detail.<n>A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation.<n>A high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures.<n>This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost.
arXiv Detail & Related papers (2026-03-04T05:29:29Z)
PoseDiff: A Unified Diffusion Model Bridging Robot Pose Estimation and Video-to-Action Control [67.17998939712326]
We present PoseDiff, a conditional diffusion model that unifies robot state estimation and control within a single framework.<n>At its core, PoseDiff maps raw visual observations into structured robot states-such as 3D keypoints or joint angles-from a single RGB image.<n>Building upon this foundation, PoseDiff extends naturally to video-to-action inverse dynamics.
arXiv Detail & Related papers (2025-09-29T10:55:48Z)
An End-to-End Framework for Video Multi-Person Pose Estimation [3.090225730976977]
We propose VEPE (Video Endto-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video.<n>We show that our approach outperforms two-stage models by 300% and by inference by 300%.
arXiv Detail & Related papers (2025-09-01T03:34:57Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Generative Hierarchical Temporal Transformer for Hand Pose and Action Modeling [67.94143911629143]
We propose a generative Transformer VAE architecture to model hand pose and action. To faithfully model the semantic dependency and different temporal granularity of hand pose and action, we decompose the framework into two cascaded VAE blocks. Results show that our joint modeling of recognition and prediction improves over isolated solutions.
arXiv Detail & Related papers (2023-11-29T05:28:39Z)
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames. Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion. We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z)
OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos [21.893572076171527]
We propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers. We achieve state-of-the-art pose estimation results for PoseTrack 2017 and PoseTrack 2018 datasets.
arXiv Detail & Related papers (2022-07-20T08:06:06Z)
Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation. We exploit temporal information in videos and propose a self-attention module. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z)
Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection. Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.