Probabilistic Representations for Video Contrastive Learning
- URL: http://arxiv.org/abs/2204.03946v1
- Date: Fri, 8 Apr 2022 09:09:30 GMT
- Title: Probabilistic Representations for Video Contrastive Learning
- Authors: Jungin Park, Jiyoung Lee, Ig-Jae Kim, Kwanghoon Sohn
- Abstract summary: This paper presents a self-supervised representation learning method that bridges contrastive learning with probabilistic representation.
By sampling embeddings from the whole video distribution, we can circumvent the careful sampling strategy or transformations to generate augmented views of the clips.
- Score: 64.47354178088784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Probabilistic Video Contrastive Learning, a
self-supervised representation learning method that bridges contrastive
learning with probabilistic representation. We hypothesize that the clips
composing the video have different distributions in short-term duration, but
can represent the complicated and sophisticated video distribution through
combination in a common embedding space. Thus, the proposed method represents
video clips as normal distributions and combines them into a Mixture of
Gaussians to model the whole video distribution. By sampling embeddings from
the whole video distribution, we can circumvent the careful sampling strategy
or transformations to generate augmented views of the clips, unlike previous
deterministic methods that have mainly focused on such sample generation
strategies for contrastive learning. We further propose a stochastic
contrastive loss to learn proper video distributions and handle the inherent
uncertainty from the nature of the raw video. Experimental results verify that
our probabilistic embedding stands as a state-of-the-art video representation
learning for action recognition and video retrieval on the most popular
benchmarks, including UCF101 and HMDB51.
Related papers
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.
We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.
This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised
Video Representation Learning [7.27708818665289]
We propose a novel method to suppress static visual cues (SSVC) based on probabilistic analysis for self-supervised video representation learning.
By modelling static factors in a video as a random variable, the conditional distribution of each latent variable becomes shifted and scaled normal.
Finally, positive pairs are constructed by motion-preserved videos for contrastive learning to alleviate the problem of representation bias to static cues.
arXiv Detail & Related papers (2021-12-07T16:21:22Z) - Self-Supervised Video Representation Learning by Video Incoherence
Detection [28.540645395066434]
This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning.
It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos.
arXiv Detail & Related papers (2021-09-26T04:58:13Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-supervised Video Representation Learning by Pace Prediction [48.029602040786685]
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
It stems from the observation that human visual system is sensitive to video pace.
We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
arXiv Detail & Related papers (2020-08-13T12:40:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.