Related papers: MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment

URL: http://arxiv.org/abs/2503.10287v1
Date: Thu, 13 Mar 2025 11:56:25 GMT
Title: MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
Authors: Hao Zhou, Xiaobao Guo, Yuzhe Zhu, Adams Wai-Kin Kong,
Abstract summary: We propose a method called MACS to conduct multi-source audio-to-image generation.<n>This is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation.<n>The proposed MACS outperforms the current state-of-the-art methods in 17 of the 21 evaluation indexes on all tasks.
Score: 20.54227825704359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-model task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, a method called MACS is proposed to conduct multi-source audio-to-image generation. This is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, efficient image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 of the 21 evaluation indexes on all tasks and delivers superior visual quality. The code will be publicly available.

Related papers

Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator [3.082874165638936]
We propose an Audio-Visual Generation and Separation model (AV-GAS) for generating images from soundscapes. Our contribution is threefold: First, we propose a new challenge in the audio-visual generation task, which is to generate an image given a multi-class audio input. Second, we introduce a new audio-visual separation task, which involves generating separate images for each class present in a mixed audio input.
arXiv Detail & Related papers (2025-04-25T11:51:04Z)
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation [44.21422404659117]
UniForm is a unified multi-task diffusion transformer that jointly generates audio and visual modalities in a shared latent space. A single diffusion process models both audio and video, capturing the inherent correlations between sound and vision. By leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches.
arXiv Detail & Related papers (2025-02-06T09:18:30Z)
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously. We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks. We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes. Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z)
Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models. Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z)
New Audio Representations Image Gan Generation from BriVL [0.0]
We propose a robust audio representation learning method WavBriVL based on Bridging-Vision-and-Language (BriVL) WavBriVL projects audio, image and text into a shared embedded space, so that multi-modal applications can be realized.
arXiv Detail & Related papers (2023-03-08T13:58:55Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously. MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.