Mutual Information-Based Temporal Difference Learning for Human Pose
Estimation in Video
- URL: http://arxiv.org/abs/2303.08475v2
- Date: Mon, 8 May 2023 13:43:41 GMT
- Title: Mutual Information-Based Temporal Difference Learning for Human Pose
Estimation in Video
- Authors: Runyang Feng, Yixing Gao, Xueqing Ma, Tze Ho Elden Tse, Hyung Jin
Chang
- Abstract summary: We present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts.
To be specific, we design a multi-stage entangled learning sequences conditioned on multi-stage differences to derive informative motion representation sequences.
These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark HiEve.
- Score: 16.32910684198013
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Temporal modeling is crucial for multi-frame human pose estimation. Most
existing methods directly employ optical flow or deformable convolution to
predict full-spectrum motion fields, which might incur numerous irrelevant
cues, such as a nearby person or background. Without further efforts to
excavate meaningful motion priors, their results are suboptimal, especially in
complicated spatiotemporal interactions. On the other hand, the temporal
difference has the ability to encode representative motion information which
can potentially be valuable for pose estimation but has not been fully
exploited. In this paper, we present a novel multi-frame human pose estimation
framework, which employs temporal differences across frames to model dynamic
contexts and engages mutual information objectively to facilitate useful motion
information disentanglement. To be specific, we design a multi-stage Temporal
Difference Encoder that performs incremental cascaded learning conditioned on
multi-stage feature difference sequences to derive informative motion
representation. We further propose a Representation Disentanglement module from
the mutual information perspective, which can grasp discriminative
task-relevant motion signals by explicitly defining useful and noisy
constituents of the raw motion features and minimizing their mutual
information. These place us to rank No.1 in the Crowd Pose Estimation in
Complex Events Challenge on benchmark dataset HiEve, and achieve
state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018,
and PoseTrack21.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Joint-Motion Mutual Learning for Pose Estimation in Videos [21.77871402339573]
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision.
Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation.
We propose a novel joint-motion mutual learning framework for pose estimation.
arXiv Detail & Related papers (2024-08-05T07:37:55Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - Motion Prediction via Joint Dependency Modeling in Phase Space [40.54430409142653]
We introduce a novel convolutional neural model to leverage explicit prior knowledge of motion anatomy.
We then propose a global optimization module that learns the implicit relationships between individual joint features.
Our method is evaluated on large-scale 3D human motion benchmark datasets.
arXiv Detail & Related papers (2022-01-07T08:30:01Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Investigating Pose Representations and Motion Contexts Modeling for 3D
Motion Prediction [63.62263239934777]
We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task.
We propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction.
Our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency.
arXiv Detail & Related papers (2021-12-30T10:45:22Z) - Exploring Versatile Prior for Human Motion via Motion Frequency Guidance [32.50770614788775]
We learn a framework to learn the versatile motion prior, which models the inherent probability distribution of human motions.
For efficient prior representation learning, we propose a global orientation normalization to remove redundant environment information.
We then adopt a denoising training scheme to disentangle the environment information from input motion data in a learnable way.
arXiv Detail & Related papers (2021-11-25T13:24:44Z) - Improving Robustness and Accuracy via Relative Information Encoding in
3D Human Pose Estimation [59.94032196768748]
We propose a relative information encoding method that yields positional and temporal enhanced representations.
Our method outperforms state-of-the-art methods on two public datasets.
arXiv Detail & Related papers (2021-07-29T14:12:19Z) - Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [51.17064599766138]
We have developed a method to identify independently objects acquired with an event-based camera.
The method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
arXiv Detail & Related papers (2020-12-16T04:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.