Temporal Feature Alignment and Mutual Information Maximization for
Video-Based Human Pose Estimation
- URL: http://arxiv.org/abs/2203.15227v1
- Date: Tue, 29 Mar 2022 04:29:16 GMT
- Title: Temporal Feature Alignment and Mutual Information Maximization for
Video-Based Human Pose Estimation
- Authors: Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao,
Yunjun Gao, Xiang Wang
- Abstract summary: We present a novel hierarchical alignment framework for multi-frame human pose estimation.
We rank No.1 in the Multi-frame Person Pose Estimation Challenge on benchmark dataset PoseTrack 2017, and obtain state-of-the-art performance on benchmarks Sub-JHMDB and Pose-Track 2018.
- Score: 38.571715193347366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-frame human pose estimation has long been a compelling and fundamental
problem in computer vision. This task is challenging due to fast motion and
pose occlusion that frequently occur in videos. State-of-the-art methods strive
to incorporate additional visual evidences from neighboring frames (supporting
frames) to facilitate the pose estimation of the current frame (key frame). One
aspect that has been obviated so far, is the fact that current methods directly
aggregate unaligned contexts across frames. The spatial-misalignment between
pose features of the current frame and neighboring frames might lead to
unsatisfactory results. More importantly, existing approaches build upon the
straightforward pose estimation loss, which unfortunately cannot constrain the
network to fully leverage useful information from neighboring frames. To tackle
these problems, we present a novel hierarchical alignment framework, which
leverages coarse-to-fine deformations to progressively update a neighboring
frame to align with the current frame at the feature level. We further propose
to explicitly supervise the knowledge extraction from neighboring frames,
guaranteeing that useful complementary cues are extracted. To achieve this
goal, we theoretically analyzed the mutual information between the frames and
arrived at a loss that maximizes the task-relevant mutual information. These
allow us to rank No.1 in the Multi-frame Person Pose Estimation Challenge on
benchmark dataset PoseTrack2017, and obtain state-of-the-art performance on
benchmarks Sub-JHMDB and Pose-Track2018. Our code is released at
https://github. com/Pose-Group/FAMI-Pose, hoping that it will be useful to the
community.
Related papers
- Video Dynamics Prior: An Internal Learning Approach for Robust Video
Enhancements [83.5820690348833]
We present a framework for low-level vision tasks that does not require any external training data corpus.
Our approach learns neural modules by optimizing over a corrupted sequence, leveraging the weights of the coherence-temporal test and statistics internal statistics.
arXiv Detail & Related papers (2023-12-13T01:57:11Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Video Frame Interpolation with Densely Queried Bilateral Correlation [52.823751291070906]
Video Frame Interpolation (VFI) aims to synthesize non-existent intermediate frames between existent frames.
Flow-based VFI algorithms estimate intermediate motion fields to warp the existent frames.
We propose Densely Queried Bilateral Correlation (DQBC) that gets rid of the receptive field dependency problem.
arXiv Detail & Related papers (2023-04-26T14:45:09Z) - Kinematic-aware Hierarchical Attention Network for Human Pose Estimation
in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames.
Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion.
We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z) - Alignment-guided Temporal Attention for Video Action Recognition [18.5171795689609]
We show that frame-by-frame alignments have the potential to increase the mutual information between frame representations.
We propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames.
arXiv Detail & Related papers (2022-09-30T23:10:47Z) - OTPose: Occlusion-Aware Transformer for Pose Estimation in
Sparsely-Labeled Videos [21.893572076171527]
We propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers.
We achieve state-of-the-art pose estimation results for PoseTrack 2017 and PoseTrack 2018 datasets.
arXiv Detail & Related papers (2022-07-20T08:06:06Z) - Exploring Motion Ambiguity and Alignment for High-Quality Video Frame
Interpolation [46.02120172459727]
We propose to relax the requirement of reconstructing an intermediate frame as close to the ground-truth (GT) as possible.
We develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames.
arXiv Detail & Related papers (2022-03-19T10:37:06Z) - TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches.
We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z) - Learning to Estimate Hidden Motions with Global Motion Aggregation [71.12650817490318]
Occlusions pose a significant challenge to optical flow algorithms that rely on local evidences.
We introduce a global motion aggregation module to find long-range dependencies between pixels in the first image.
We demonstrate that the optical flow estimates in the occluded regions can be significantly improved without damaging the performance in non-occluded regions.
arXiv Detail & Related papers (2021-04-06T10:32:03Z) - Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.
Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.