OTPose: Occlusion-Aware Transformer for Pose Estimation in
Sparsely-Labeled Videos
- URL: http://arxiv.org/abs/2207.09725v1
- Date: Wed, 20 Jul 2022 08:06:06 GMT
- Title: OTPose: Occlusion-Aware Transformer for Pose Estimation in
Sparsely-Labeled Videos
- Authors: Kyung-Min Jin, Gun-Hee Lee and Seong-Whan Lee
- Abstract summary: We propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers.
We achieve state-of-the-art pose estimation results for PoseTrack 2017 and PoseTrack 2018 datasets.
- Score: 21.893572076171527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although many approaches for multi-human pose estimation in videos have shown
profound results, they require densely annotated data which entails excessive
man labor. Furthermore, there exists occlusion and motion blur that inevitably
lead to poor estimation performance. To address these problems, we propose a
method that leverages an attention mask for occluded joints and encodes
temporal dependency between frames using transformers. First, our framework
composes different combinations of sparsely annotated frames that denote the
track of the overall joint movement. We propose an occlusion attention mask
from these combinations that enable encoding occlusion-aware heatmaps as a
semi-supervised task. Second, the proposed temporal encoder employs transformer
architecture to effectively aggregate the temporal relationship and
keypoint-wise attention from each time step and accurately refines the target
frame's final pose estimation. We achieve state-of-the-art pose estimation
results for PoseTrack2017 and PoseTrack2018 datasets and demonstrate the
robustness of our approach to occlusion and motion blur in sparsely annotated
video data.
Related papers
- DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild [85.03973683867797]
This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild.
We show that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes.
arXiv Detail & Related papers (2024-11-20T13:01:16Z) - Event-Based Frame Interpolation with Ad-hoc Deblurring [68.97825675372354]
We propose a general method for event-based frame that performs deblurring ad-hoc on input videos.
Our network consistently outperforms state-of-the-art methods on frame, single image deblurring and the joint task of deblurring.
Our code and dataset will be made publicly available.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - Kinematic-aware Hierarchical Attention Network for Human Pose Estimation
in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames.
Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion.
We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z) - Video Shadow Detection via Spatio-Temporal Interpolation Consistency
Training [31.115226660100294]
We propose a framework to feed the unlabeled video frames together with the labeled images into an image shadow detection network training.
We then derive the spatial and temporal consistency constraints accordingly for enhancing generalization in the pixel-wise classification.
In addition, we design a Scale-Aware Network for multi-scale shadow knowledge learning in images.
arXiv Detail & Related papers (2022-06-17T14:29:51Z) - Temporal Feature Alignment and Mutual Information Maximization for
Video-Based Human Pose Estimation [38.571715193347366]
We present a novel hierarchical alignment framework for multi-frame human pose estimation.
We rank No.1 in the Multi-frame Person Pose Estimation Challenge on benchmark dataset PoseTrack 2017, and obtain state-of-the-art performance on benchmarks Sub-JHMDB and Pose-Track 2018.
arXiv Detail & Related papers (2022-03-29T04:29:16Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Learning to Estimate Hidden Motions with Global Motion Aggregation [71.12650817490318]
Occlusions pose a significant challenge to optical flow algorithms that rely on local evidences.
We introduce a global motion aggregation module to find long-range dependencies between pixels in the first image.
We demonstrate that the optical flow estimates in the occluded regions can be significantly improved without damaging the performance in non-occluded regions.
arXiv Detail & Related papers (2021-04-06T10:32:03Z) - Deep Dual Consecutive Network for Human Pose Estimation [44.41818683253614]
We propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection.
Our method ranks No.1 in the Multi-frame Person Pose Challenge Challenge on the large-scale benchmark datasets PoseTrack 2017 and PoseTrack 2018.
arXiv Detail & Related papers (2021-03-12T13:11:27Z) - Motion-blurred Video Interpolation and Extrapolation [72.3254384191509]
We present a novel framework for deblurring, interpolating and extrapolating sharp frames from a motion-blurred video in an end-to-end manner.
To ensure temporal coherence across predicted frames and address potential temporal ambiguity, we propose a simple, yet effective flow-based rule.
arXiv Detail & Related papers (2021-03-04T12:18:25Z) - A Deep Temporal Fusion Framework for Scene Flow Using a Learnable Motion
Model and Occlusions [17.66624674542256]
We propose a novel data-driven approach for temporal fusion of scene flow estimates in a multi-frame setup.
In a second step, a neural network combines bi-directional scene flow estimates from a common reference frame, yielding a refined estimate.
This way, our approach provides a fast multi-frame extension for a variety of scene flow estimators, which outperforms the underlying dual-frame approaches.
arXiv Detail & Related papers (2020-11-03T10:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.