Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective
- URL: http://arxiv.org/abs/2407.14069v1
- Date: Fri, 19 Jul 2024 06:53:54 GMT
- Title: Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective
- Authors: Zeen Song, Jingyao Wang, Jianqi Zhang, Changwen Zheng, Wenwen Qiang,
- Abstract summary: We propose "Bi-level Optimization of Learning Dynamic with Decoupling and Intervention" (BOLD-DI) to capture both static and dynamic semantics in a decoupled manner.
Our method can be seamlessly integrated into the existing v-CL methods and experimental results highlight the significant improvements.
- Score: 10.938290904843939
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video contrastive learning (v-CL) has gained prominence as a leading framework for unsupervised video representation learning, showcasing impressive performance across various tasks such as action classification and detection. In the field of video representation learning, a feature extractor should ideally capture both static and dynamic semantics. However, our series of experiments reveals that existing v-CL methods predominantly capture static semantics, with limited capturing of dynamic semantics. Through causal analysis, we identify the root cause: the v-CL objective lacks explicit modeling of dynamic features and the measurement of dynamic similarity is confounded by static semantics, while the measurement of static similarity is confounded by dynamic semantics. In response, we propose "Bi-level Optimization of Learning Dynamic with Decoupling and Intervention" (BOLD-DI) to capture both static and dynamic semantics in a decoupled manner. Our method can be seamlessly integrated into the existing v-CL methods and experimental results highlight the significant improvements.
Related papers
- UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos [83.48170683672427]
UniLearn is a unified learning paradigm that integrates static facial expression recognition data to enhance DFER task.
UniLearn consistently state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM [17.661231232206028]
Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention.
We propose a novel SLAM framework for dynamic environments.
arXiv Detail & Related papers (2024-07-18T09:35:48Z) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation.
HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model.
We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z) - Prompt-Driven Dynamic Object-Centric Learning for Single Domain
Generalization [61.64304227831361]
Single-domain generalization aims to learn a model from single source domain data to achieve generalized performance on other unseen target domains.
We propose a dynamic object-centric perception network based on prompt learning, aiming to adapt to the variations in image complexity.
arXiv Detail & Related papers (2024-02-28T16:16:51Z) - DytanVO: Joint Refinement of Visual Odometry and Motion Segmentation in
Dynamic Environments [6.5121327691369615]
We present DytanVO, the first supervised learning-based VO method that deals with dynamic environments.
Our method achieves an average improvement of 27.7% in ATE over state-of-the-art VO solutions in real-world dynamic environments.
arXiv Detail & Related papers (2022-09-17T23:56:03Z) - STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding [68.96574451918458]
We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch.
Both the static and dynamic branches are designed as cross-modal transformers.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
arXiv Detail & Related papers (2022-07-06T15:48:58Z) - Multi-modal Visual Place Recognition in Dynamics-Invariant Perception
Space [23.43468556831308]
This letter explores the use of multi-modal fusion of semantic and visual modalities to improve place recognition in dynamic environments.
We achieve this by first designing a novel deep learning architecture to generate the static semantic segmentation.
We then innovatively leverage the spatial-pyramid-matching model to encode the static semantic segmentation into feature vectors.
In parallel, the static image is encoded using the popular Bag-of-words model.
arXiv Detail & Related papers (2021-05-17T13:14:52Z) - HyperDynamics: Meta-Learning Object and Agent Dynamics with
Hypernetworks [18.892883695539002]
HyperDynamics is a dynamics meta-learning framework that generates parameters of neural dynamics models.
It outperforms existing models that adapt to environment variations by learning dynamics over high dimensional visual observations.
We show our method matches the performance of an ensemble of separately trained experts, while also being able to generalize well to unseen environment variations at test time.
arXiv Detail & Related papers (2021-03-17T04:48:43Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z) - Euclideanizing Flows: Diffeomorphic Reduction for Learning Stable
Dynamical Systems [74.80320120264459]
We present an approach to learn such motions from a limited number of human demonstrations.
The complex motions are encoded as rollouts of a stable dynamical system.
The efficacy of this approach is demonstrated through validation on an established benchmark as well demonstrations collected on a real-world robotic system.
arXiv Detail & Related papers (2020-05-27T03:51:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.