Enhancing Self-supervised Video Representation Learning via Multi-level
Feature Optimization
- URL: http://arxiv.org/abs/2108.02183v1
- Date: Wed, 4 Aug 2021 17:16:18 GMT
- Title: Enhancing Self-supervised Video Representation Learning via Multi-level
Feature Optimization
- Authors: Rui Qian, Yuxi Li, Huabin Liu, John See, Shuangrui Ding, Xian Liu,
Dian Li, Weiyao Lin
- Abstract summary: This paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations.
Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
- Score: 30.670109727802494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The crux of self-supervised video representation learning is to build general
features from unlabeled videos. However, most recent works have mainly focused
on high-level semantics and neglected lower-level representations and their
temporal relationship which are crucial for general video understanding. To
address these challenges, this paper proposes a multi-level feature
optimization framework to improve the generalization and temporal modeling
ability of learned video representations. Concretely, high-level features
obtained from naive and prototypical contrastive learning are utilized to build
distribution graphs, guiding the process of low-level and mid-level feature
learning. We also devise a simple temporal modeling module from multi-level
features to enhance motion pattern learning. Experiments demonstrate that
multi-level feature optimization with the graph constraint and temporal
modeling can greatly improve the representation ability in video understanding.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Towards Modality Transferable Visual Information Representation with
Optimal Model Compression [67.89885998586995]
We propose a new scheme for visual signal representation that leverages the philosophy of transferable modality.
The proposed framework is implemented on the state-of-the-art video coding standard.
arXiv Detail & Related papers (2020-08-13T01:52:40Z) - Hierarchical Contrastive Motion Learning for Video Action Recognition [100.9807616796383]
We present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames.
Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network.
Our motion learning module is lightweight and flexible to be embedded into various backbone networks.
arXiv Detail & Related papers (2020-07-20T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.