Related papers: An End-to-End Framework for Video Multi-Person Pose Estimation

An End-to-End Framework for Video Multi-Person Pose Estimation

URL: http://arxiv.org/abs/2509.01095v1
Date: Mon, 01 Sep 2025 03:34:57 GMT
Title: An End-to-End Framework for Video Multi-Person Pose Estimation
Authors: Zhihong Wei,
Abstract summary: We propose VEPE (Video Endto-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video.<n>We show that our approach outperforms two-stage models by 300% and by inference by 300%.
Score: 3.090225730976977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-based human pose estimation models aim to address scenarios that cannot be effectively solved by static image models such as motion blur, out-of-focus and occlusion. Most existing approaches consist of two stages: detecting human instances in each image frame and then using a temporal model for single-person pose estimation. This approach separates the spatial and temporal dimensions and cannot capture the global spatio-temporal context between spatial instances for end-to-end optimization. In addition, it relies on separate detectors and complex post-processing such as RoI cropping and NMS, which reduces the inference efficiency of the video scene. To address the above problems, we propose VEPE (Video End-to-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video. The framework utilizes three crucial spatio-temporal Transformer components: the Spatio-Temporal Pose Encoder (STPE), the Spatio-Temporal Deformable Memory Encoder (STDME), and the Spatio-Temporal Pose Decoder (STPD). These components are designed to effectively utilize temporal context for optimizing human body pose estimation. Furthermore, to reduce the mismatch problem during the cross-frame pose query matching process, we propose an instance consistency mechanism, which aims to enhance the consistency and discrepancy of the cross-frame instance query and realize the instance tracking function, which in turn accurately guides the pose query to perform cross-frame matching. Extensive experiments on the Posetrack dataset show that our approach outperforms most two-stage models and improves inference efficiency by 300%.

Related papers

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer [7.19764062839405]
We present a fully end-to-end framework for multi-person 2D pose estimation in videos.<n>A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories.<n>We introduce a novel Pose-Aware VideoErEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and atemporal decoder pose.
arXiv Detail & Related papers (2025-11-17T10:19:35Z)
PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis [82.87579563469039]
Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision.<n>Recent approaches attempt to address this by synthesizing intermediate frames using video and selecting key frames via a self-consistency score.<n>We propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video model with a pose-conditioned novel view model.<n>We also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results.
arXiv Detail & Related papers (2025-10-22T12:32:37Z)
GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
Towards Robust and Expressive Whole-body Human Pose and Shape Estimation [51.457517178632756]
Whole-body pose and shape estimation aims to jointly predict different behaviors of the entire human body from a monocular image. Existing methods often exhibit degraded performance under the complexity of in-the-wild scenarios. We propose a novel framework to enhance the robustness of whole-body pose and shape estimation.
arXiv Detail & Related papers (2023-12-14T08:17:42Z)
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z)
Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames. Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion. We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z)
OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos [21.893572076171527]
We propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers. We achieve state-of-the-art pose estimation results for PoseTrack 2017 and PoseTrack 2018 datasets.
arXiv Detail & Related papers (2022-07-20T08:06:06Z)
MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation. We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation. In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z)
Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation. We exploit temporal information in videos and propose a self-attention module. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z)
Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection. Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.