Bridging High-Quality Audio and Video via Language for Sound Effects
Retrieval from Visual Queries
- URL: http://arxiv.org/abs/2308.09089v1
- Date: Thu, 17 Aug 2023 16:38:30 GMT
- Title: Bridging High-Quality Audio and Video via Language for Sound Effects
Retrieval from Visual Queries
- Authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello,
Oriol Nieto
- Abstract summary: Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task.
We propose a framework for recommending HQ SFX given a video frame.
We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data.
- Score: 18.224608377111533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finding the right sound effects (SFX) to match moments in a video is a
difficult and time-consuming task, and relies heavily on the quality and
completeness of text metadata. Retrieving high-quality (HQ) SFX using a video
frame directly as the query is an attractive alternative, removing the reliance
on text metadata and providing a low barrier to entry for non-experts. Due to
the lack of HQ audio-visual training data, previous work on audio-visual
retrieval relies on YouTube (in-the-wild) videos of varied quality for
training, where the audio is often noisy and the video of amateur quality. As
such it is unclear whether these systems would generalize to the task of
matching HQ audio to production-quality video. To address this, we propose a
multimodal framework for recommending HQ SFX given a video frame by (1)
leveraging large language models and foundational vision-language models to
bridge HQ audio and video to create audio-visual pairs, resulting in a highly
scalable automatic audio-visual data curation pipeline; and (2) using
pre-trained audio and visual encoders to train a contrastive learning-based
retrieval system. We show that our system, trained using our automatic data
curation pipeline, significantly outperforms baselines trained on in-the-wild
data on the task of HQ SFX retrieval for video. Furthermore, while the
baselines fail to generalize to this task, our system generalizes well from
clean to in-the-wild data, outperforming the baselines on a dataset of YouTube
videos despite only being trained on the HQ audio-visual pairs. A user study
confirms that people prefer SFX retrieved by our system over the baseline 67%
of the time both for HQ and in-the-wild data. Finally, we present ablations to
determine the impact of model and data pipeline design choices on downstream
retrieval performance. Please visit our project website to listen to and view
our SFX retrieval results.
Related papers
- Audio-visual training for improved grounding in video-text LLMs [1.9320359360360702]
We propose a model architecture that handles audio-visual inputs explicitly.
We train our model with both audio and visual data from a video instruction-tuning dataset.
For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset.
arXiv Detail & Related papers (2024-07-21T03:59:14Z) - video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models [27.54879344983513]
Video-SALMONN can understand not only visual frame sequences, audio events and music, but speech as well.
Video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs.
arXiv Detail & Related papers (2024-06-22T01:36:11Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - An investigation on selecting audio pre-trained models for audio
captioning [5.837881923712393]
Pre-trained models are widely used in audio captioning due to high complexity.
Unless a comprehensive system is re-trained, it is hard to determine how well pre-trained models contribute to audio captioning system.
In this paper, a series of pre-trained models are investigated for the correlation between extracted audio features and the performance of audio captioning.
arXiv Detail & Related papers (2022-08-12T06:14:20Z) - MERLOT Reserve: Neural Script Knowledge through Vision and Language and
Sound [90.1857707251566]
We introduce MERLOT Reserve, a model that represents videos jointly over time.
We replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Our objective learns faster than alternatives, and performs well at scale.
arXiv Detail & Related papers (2022-01-07T19:00:21Z) - Weakly Supervised Construction of ASR Systems with Massive Video Data [18.5050375783871]
We present a weakly supervised framework for constructing ASR systems with massive video data.
We propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR)
Our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
arXiv Detail & Related papers (2020-08-04T03:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.