LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
- URL: http://arxiv.org/abs/2207.08024v1
- Date: Sat, 16 Jul 2022 21:46:16 GMT
- Title: LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
- Authors: Sumanth Gurram, Andy Fang, David Chan, John Canny
- Abstract summary: We propose a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representations in a self-supervised manner.
We demonstrate that LAVA performs competitively with the current state-of-the-art self-supervised and weakly-supervised pretraining techniques.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating representations of video data is of key importance in advancing
the field of machine perception. Most current techniques rely on hand-annotated
data, which can be difficult to work with, expensive to generate, and hard to
scale. In this work, we propose a novel learning approach based on contrastive
learning, LAVA, which is capable of learning joint language, audio, and video
representations in a self-supervised manner. We pre-train LAVA on the Kinetics
700 dataset using transformer encoders to learn representations for each
modality. We then demonstrate that LAVA performs competitively with the current
state-of-the-art self-supervised and weakly-supervised pretraining techniques
on UCF-101 and HMDB-51 video action recognition while using a fraction of the
unlabeled data.
Related papers
- Pre-trained Visual Dynamics Representations for Efficient Policy Learning [33.62440075940917]
We propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning.
The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos.
This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation.
arXiv Detail & Related papers (2024-11-05T15:18:02Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Preserve Pre-trained Knowledge: Transfer Learning With Self-Distillation
For Action Recognition [8.571437792425417]
We propose a novel transfer learning approach that combines self-distillation in fine-tuning to preserve knowledge from the pre-trained model learned from the large-scale dataset.
Specifically, we fix the encoder from the last epoch as the teacher model to guide the training of the encoder from the current epoch in the transfer learning.
arXiv Detail & Related papers (2022-05-01T16:31:25Z) - Auxiliary Learning for Self-Supervised Video Representation via
Similarity-based Knowledge Distillation [2.6519061087638014]
We propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD.
Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data.
We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations.
arXiv Detail & Related papers (2021-12-07T21:50:40Z) - Masking Modalities for Cross-modal Video Retrieval [93.10669981708878]
A common strategy for pre-training video encoders is to use the accompanying speech as weak supervision.
We propose to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.
arXiv Detail & Related papers (2021-11-01T23:55:04Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.