Hierarchical Contrastive Motion Learning for Video Action Recognition
- URL: http://arxiv.org/abs/2007.10321v3
- Date: Mon, 17 Jan 2022 09:30:18 GMT
- Title: Hierarchical Contrastive Motion Learning for Video Action Recognition
- Authors: Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, Larry Davis, Jan
Kautz
- Abstract summary: We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
- Score: 100.9807616796383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One central question for video action recognition is how to model motion. In
this paper, we present hierarchical contrastive motion learning, a new
self-supervised learning framework to extract effective motion representations
from raw video frames. Our approach progressively learns a hierarchy of motion
features that correspond to different abstraction levels in a network. This
hierarchical design bridges the semantic gap between low-level motion cues and
high-level recognition tasks, and promotes the fusion of appearance and motion
information at multiple levels. At each level, an explicit motion
self-supervision is provided via contrastive learning to enforce the motion
features at the current level to predict the future ones at the previous level.
Thus, the motion features at higher levels are trained to gradually capture
semantic dynamics and evolve more discriminative for action recognition. Our
motion learning module is lightweight and flexible to be embedded into various
backbone networks. Extensive experiments on four benchmarks show that the
proposed approach consistently achieves superior results.
Related papers
- Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows [21.17248975377718]
Learning with neural networks presents several challenges due to the non-i.i.d. nature of the data.
It also offers novel opportunities to develop representations that are consistent with the information flow.
In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints.
arXiv Detail & Related papers (2024-09-16T19:08:32Z) - Joint-Motion Mutual Learning for Pose Estimation in Videos [21.77871402339573]
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision.
Recent methods strive to integrate multi-frame visual features generated by a backbone network for pose estimation.
We propose a novel joint-motion mutual learning framework for pose estimation.
arXiv Detail & Related papers (2024-08-05T07:37:55Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Semantics-aware Motion Retargeting with Vision-Language Models [19.53696208117539]
We present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics.
We utilize a differentiable module to render 3D motions and the high-level motion semantics are incorporated into the motion process by feeding the vision-language model and aligning the extracted semantic embeddings.
To ensure the preservation of fine-grained motion details and high-level semantics, we adopt two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints.
arXiv Detail & Related papers (2023-12-04T15:23:49Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - Point Contrastive Prediction with Semantic Clustering for
Self-Supervised Learning on Point Cloud Videos [71.20376514273367]
We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data.
Our method outperforms supervised counterparts on a wide range of downstream tasks.
arXiv Detail & Related papers (2023-08-18T02:17:47Z) - CALM: Conditional Adversarial Latent Models for Directable Virtual
Characters [71.66218592749448]
We present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters.
Using imitation learning, CALM learns a representation of movement that captures the complexity of human motion, and enables direct control over character movements.
arXiv Detail & Related papers (2023-05-02T09:01:44Z) - Contrast-reconstruction Representation Learning for Self-supervised
Skeleton-based Action Recognition [18.667198945509114]
We propose a novel Contrast-Reconstruction Representation Learning network (CRRL)
It simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition.
Experimental results on several benchmarks, i.e., NTU RGB+D 60, NTU RGB+D 120, CMU mocap, and NW-UCLA, demonstrate the promise of the proposed CRRL method.
arXiv Detail & Related papers (2021-11-22T08:45:34Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Enhancing Self-supervised Video Representation Learning via Multi-level
Feature Optimization [30.670109727802494]
This paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations.
Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
arXiv Detail & Related papers (2021-08-04T17:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.