Cascaded Multilingual Audio-Visual Learning from Videos
- URL: http://arxiv.org/abs/2111.04823v1
- Date: Mon, 8 Nov 2021 20:53:50 GMT
- Title: Cascaded Multilingual Audio-Visual Learning from Videos
- Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas,
Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury,
Michael Picheny, James Glass
- Abstract summary: We propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages.
With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely.
We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.
- Score: 49.44796976615445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore self-supervised audio-visual models that learn from
instructional videos. Prior work has shown that these models can relate spoken
words and sounds to visual content after training on a large-scale dataset of
videos, but they were only trained and evaluated on videos in English. To learn
multilingual audio-visual representations, we propose a cascaded approach that
leverages a model trained on English videos and applies it to audio-visual data
in other languages, such as Japanese videos. With our cascaded approach, we
show an improvement in retrieval performance of nearly 10x compared to training
on the Japanese videos solely. We also apply the model trained on English
videos to Japanese and Hindi spoken captions of images, achieving
state-of-the-art performance.
Related papers
- Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models [13.855545744177586]
This paper examines the performance of existing audio language models in an underserved language using Thai.
Despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities.
This paper integrates audio comprehension and speech instruction-following capabilities into a single unified model.
arXiv Detail & Related papers (2024-09-17T09:04:03Z) - Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - Fine-grained Audible Video Description [61.81122862375985]
We construct the first fine-grained audible video description benchmark (FAVDBench)
For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end.
We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
arXiv Detail & Related papers (2023-03-27T22:03:48Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval [39.41224716332499]
We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval.
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages.
We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
arXiv Detail & Related papers (2022-10-07T15:30:24Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.