V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
Connecting Foundation Models
- URL: http://arxiv.org/abs/2308.09300v4
- Date: Thu, 14 Dec 2023 00:15:14 GMT
- Title: V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by
Connecting Foundation Models
- Authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, Weidong
Cai
- Abstract summary: Building artificial intelligence systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research.
We propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM.
Our method only requires a quick training of the V2A-Mapper to produce high-fidelity and visually-aligned sound.
- Score: 14.538853403226751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building artificial intelligence (AI) systems on top of a set of foundation
models (FMs) is becoming a new paradigm in AI research. Their representative
and generative abilities learnt from vast amounts of data can be easily adapted
and transferred to a wide range of downstream tasks without extra training from
scratch. However, leveraging FMs in cross-modal generation remains
under-researched when audio modality is involved. On the other hand,
automatically generating semantically-relevant sound from visual input is an
important problem in cross-modal generation studies. To solve this
vision-to-audio (V2A) generation problem, existing methods tend to design and
build complex systems from scratch using modestly sized datasets. In this
paper, we propose a lightweight solution to this problem by leveraging
foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate
the domain gap between the latent space of the visual CLIP and the auditory
CLAP models. Then we propose a simple yet effective mapper mechanism
(V2A-Mapper) to bridge the domain gap by translating the visual input between
CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained
audio generative FM AudioLDM is adopted to produce high-fidelity and
visually-aligned sound. Compared to previous approaches, our method only
requires a quick training of the V2A-Mapper. We further analyze and conduct
extensive experiments on the choice of the V2A-Mapper and show that a
generative mapper is better at fidelity and variability (FD) while a regression
mapper is slightly better at relevance (CS). Both objective and subjective
evaluation on two V2A datasets demonstrate the superiority of our proposed
method compared to current state-of-the-art approaches - trained with 86% fewer
parameters but achieving 53% and 19% improvement in FD and CS, respectively.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Aligning Modalities in Vision Large Language Models via Preference
Fine-tuning [67.62925151837675]
In this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning.
Specifically, we propose POVID to generate feedback data with AI models.
We use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data.
In experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches.
arXiv Detail & Related papers (2024-02-18T00:56:16Z) - Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination"
We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment.
We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information.
Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z) - HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot
Classification with Unimodal Cues [19.800985243540797]
Occlusion, intra-class variance, lighting, etc., might arise while training neural networks using unimodal RS visual input.
We propose a novel few-shot generative framework, Hallucinated Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features from limited unimodal data.
arXiv Detail & Related papers (2023-09-23T20:05:00Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling.
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%).
Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z) - Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal
Retrieval [7.459223771397159]
Cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared.
To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels.
We propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss.
arXiv Detail & Related papers (2022-11-07T10:37:14Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.