AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos
- URL: http://arxiv.org/abs/2006.09199v2
- Date: Tue, 29 Jun 2021 18:44:50 GMT
- Title: AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos
- Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj
Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio
Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass
- Abstract summary: We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
- Score: 69.56522471911396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current methods for learning visually grounded language from videos often
rely on text annotation, such as human generated captions or machine generated
automatic speech recognition (ASR) transcripts. In this work, we introduce the
Audio-Video Language Network (AVLnet), a self-supervised network that learns a
shared audio-visual embedding space directly from raw video inputs. To
circumvent the need for text annotation, we learn audio-visual representations
from randomly segmented video clips and their raw audio waveforms. We train
AVLnet on HowTo100M, a large corpus of publicly available instructional videos,
and evaluate on image retrieval and video retrieval tasks, achieving
state-of-the-art performance. We perform analysis of AVLnet's learned
representations, showing our model utilizes speech and natural sounds to learn
audio-visual concepts. Further, we propose a tri-modal model that jointly
processes raw audio, video, and text captions from videos to learn a
multi-modal semantic embedding space useful for text-video retrieval. Our code,
data, and trained models will be released at avlnet.csail.mit.edu
Related papers
- Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.