Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video
- URL: http://arxiv.org/abs/2502.10616v1
- Date: Sat, 15 Feb 2025 00:35:34 GMT
- Title: Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video
- Authors: Runyang Feng, Haoming Chen,
- Abstract summary: We present a novel framework that learns multi-level semantical dynamics and multi-frame human pose estimation.
Specifically, we first design a multi-masked context and pose reconstruction strategy.
This strategy stimulates the model to explore multi-temporal semantic relationships among frames by progressively masking the features of optical (patch) cubes and frames.
- Score: 3.2195139886901813
- License:
- Abstract: Temporal modeling and spatio-temporal collaboration are pivotal techniques for video-based human pose estimation. Most state-of-the-art methods adopt optical flow or temporal difference, learning local visual content correspondence across frames at the pixel level, to capture motion dynamics. However, such a paradigm essentially relies on localized pixel-to-pixel similarity, which neglects the semantical correlations among frames and is vulnerable to image quality degradations (e.g. occlusions or blur). Moreover, existing approaches often combine motion and spatial (appearance) features via simple concatenation or summation, leading to practical challenges in fully leveraging these distinct modalities. In this paper, we present a novel framework that learns multi-level semantical dynamics and dense spatio-temporal collaboration for multi-frame human pose estimation. Specifically, we first design a Multi-Level Semantic Motion Encoder using a multi-masked context and pose reconstruction strategy. This strategy stimulates the model to explore multi-granularity spatiotemporal semantic relationships among frames by progressively masking the features of (patch) cubes and frames. We further introduce a Spatial-Motion Mutual Learning module which densely propagates and consolidates context information from spatial and motion features to enhance the capability of the model. Extensive experiments demonstrate that our approach sets new state-of-the-art results on three benchmark datasets, PoseTrack2017, PoseTrack2018, and PoseTrack21.
Related papers
- Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with Adaptive Frame Weighting and Multi-Scale Feature Fusion [43.59385149982744]
Single-frame pose estimation has seen significant progress, but it often fails to capture the temporal dynamics for understanding complex, continuous movements.
We propose Poseidon, a novel multi-frame pose estimation architecture that extends the ViTPose model by integrating temporal information.
Our approach achieves state-of-the-art performance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scores of 88.3 and 87.8, respectively.
arXiv Detail & Related papers (2025-01-14T21:34:34Z) - Joint-Motion Mutual Learning for Pose Estimation in Videos [21.77871402339573]
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision.
Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation.
We propose a novel joint-motion mutual learning framework for pose estimation.
arXiv Detail & Related papers (2024-08-05T07:37:55Z) - Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling [70.34875558830241]
We present a way for learning a-temporal (4D) embedding, based on semantic semantic gears to allow for stratified modeling of dynamic regions of rendering the scene.
At the same time, almost for free, our tracking approach enables free-viewpoint of interest - a functionality not yet achieved by existing NeRF-based methods.
arXiv Detail & Related papers (2024-06-06T03:37:39Z) - Learning In-between Imagery Dynamics via Physical Latent Spaces [0.7366405857677226]
We present a framework designed to learn the underlying dynamics between two images observed at consecutive time steps.
By incorporating a latent variable that follows a physical model expressed in partial differential equations (PDEs), our approach ensures the interpretability of the learned model.
We demonstrate the robustness and effectiveness of our learning framework through a series of numerical tests using geoscientific imagery data.
arXiv Detail & Related papers (2023-10-14T05:14:51Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Mutual Information-Based Temporal Difference Learning for Human Pose
Estimation in Video [16.32910684198013]
We present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts.
To be specific, we design a multi-stage entangled learning sequences conditioned on multi-stage differences to derive informative motion representation sequences.
These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark HiEve.
arXiv Detail & Related papers (2023-03-15T09:29:03Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z) - Motion Prediction via Joint Dependency Modeling in Phase Space [40.54430409142653]
We introduce a novel convolutional neural model to leverage explicit prior knowledge of motion anatomy.
We then propose a global optimization module that learns the implicit relationships between individual joint features.
Our method is evaluated on large-scale 3D human motion benchmark datasets.
arXiv Detail & Related papers (2022-01-07T08:30:01Z) - ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions.
We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space.
Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.