Broaden Your Views for Self-Supervised Video Learning
- URL: http://arxiv.org/abs/2103.16559v1
- Date: Tue, 30 Mar 2021 17:58:46 GMT
- Title: Broaden Your Views for Self-Supervised Video Learning
- Authors: Adri\`a Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang,
Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica Patraucean,
Florent Altch\'e, Michal Valko, Jean-Bastien Grill, A\"aron van den Oord,
Andrew Zisserman
- Abstract summary: We introduce BraVe, a self-supervised learning framework for video.
In BraVe, one of the views has access to a narrow temporal window of the video while the other view has a broad access to the video content.
We demonstrate that BraVe achieves state-of-the-art results in self-supervised representation learning on standard video and audio classification benchmarks.
- Score: 97.52216510672251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most successful self-supervised learning methods are trained to align the
representations of two independent views from the data. State-of-the-art
methods in video are inspired by image techniques, where these two views are
similarly extracted by cropping and augmenting the resulting crop. However,
these methods miss a crucial element in the video domain: time. We introduce
BraVe, a self-supervised learning framework for video. In BraVe, one of the
views has access to a narrow temporal window of the video while the other view
has a broad access to the video content. Our models learn to generalise from
the narrow view to the general content of the video. Furthermore, BraVe
processes the views with different backbones, enabling the use of alternative
augmentations or modalities into the broad view such as optical flow, randomly
convolved RGB frames, audio or their combinations. We demonstrate that BraVe
achieves state-of-the-art results in self-supervised representation learning on
standard video and audio classification benchmarks including UCF101, HMDB51,
Kinetics, ESC-50 and AudioSet.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.