Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
- URL: http://arxiv.org/abs/2411.15447v2
- Date: Tue, 26 Nov 2024 03:49:11 GMT
- Title: Gotta Hear Them All: Sound Source Aware Vision to Audio Generation
- Authors: Wei Guo, Heng Wang, Jianbo Ma, Weidong Cai,
- Abstract summary: Vision-to-audio (V2A) has broad applications in multimedia.
We propose a Sound Source-Aware V2A (SSV2A) generator.
We show that SSV2A surpasses state-of-the-art methods in both generation fidelity and relevance.
- Score: 13.55717701044619
- License:
- Abstract: Vision-to-audio (V2A) synthesis has broad applications in multimedia. Recent advancements of V2A methods have made it possible to generate relevant audios from inputs of videos or still images. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware V2A (SSV2A) generator. SSV2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to measure localized audio relevance. This is to our knowledge the first work to address V2A generation at the sound-source level. Extensive experiments show that SSV2A surpasses state-of-the-art methods in both generation fidelity and relevance. We further demonstrate SSV2A's ability to achieve intuitive V2A control by compositing vision, text, and audio conditions. Our SSV2A generation can be tried and heard at https://ssv2a.github.io/SSV2A-demo .
Related papers
- Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos [87.32349247938136]
Existing approaches implicitly assume total correspondence between the video and audio during training.
We propose a novel ambient-aware audio generation model, AV-LDM.
Our approach is the first to focus video-to-audio generation faithfully on the observed visual content.
arXiv Detail & Related papers (2024-06-13T16:10:19Z) - Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language [77.33458847943528]
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos.
We show that DenseAV can discover the meaning'' of words and the location'' of sounds without explicit localization supervision.
arXiv Detail & Related papers (2024-06-09T03:38:21Z) - AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation [58.72068260933836]
The input and output of the system are multimodal (i.e., audio and visual speech)
We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages.
In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech.
arXiv Detail & Related papers (2023-12-05T05:36:44Z) - BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation
Knowledge [43.92428145744478]
We propose a two-stage bootstrapping audio-visual segmentation framework.
In the first stage, we employ a segmentation model to localize potential sounding objects from visual data.
In the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects.
arXiv Detail & Related papers (2023-08-20T06:48:08Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.
Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech.
We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z) - Audio-Visual Grouping Network for Sound Localization from Mixtures [30.756247389435803]
Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image.
We propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and image.
Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources.
arXiv Detail & Related papers (2023-03-29T22:58:55Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.