Related papers: DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

URL: http://arxiv.org/abs/2502.18952v1
Date: Wed, 26 Feb 2025 09:01:59 GMT
Title: DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model
Authors: Lei Zhao, Sizhou Chen, Linfeng Feng, Xiao-Lei Zhang, Xuelong Li,
Abstract summary: We propose a text-to-spatial-audio (TTSA) generation framework named DualSpec.<n>It first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio.<n>Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation.
Score: 48.57556892287629
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec.Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.

Related papers

SeeingSounds: Learning Audio-to-Visual Alignment via Text [15.011814561603964]
We introduce SeeingSounds, a framework for audio-to-image generation that leverages the interplay between audio, language, and vision.<n>Our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model.<n>This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception.
arXiv Detail & Related papers (2025-10-10T18:42:50Z)
Sci-Phi: A Large Language Model Spatial Audio Descriptor [25.302416479626974]
Sci-Phi is a spatial audio model with dual spatial and spectral encoders.<n>It enumerates and describes up to four directional sound sources in one pass.<n>It generalizes to real room impulse responses with only minor performance degradation.
arXiv Detail & Related papers (2025-10-07T03:06:02Z)
Spatial Audio Motion Understanding and Reasoning [8.029049649310211]
spatial audio reasoning enables machines to interpret auditory scenes by understanding events and their spatial attributes.<n>We introduce a spatial audio encoder that processes spatial audio to detect multiple overlapping events and estimate their spatial attributes, Direction of Arrival (DoA) and source distance, at the frame level.<n>Second, to answer complex queries about dynamic audio scenes involving moving sources, we condition a large language model (LLM) on structured spatial attributes extracted by our model.
arXiv Detail & Related papers (2025-09-18T06:53:22Z)
SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
TTMBA: Towards Text To Multiple Sources Binaural Audio Generation [7.4522240306102745]
We propose a cascaded method for text-to-multisource audio generation (TTMBA) with both temporal and spatial control.<n>A pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event.<n>Pretrained mono audio generation network creates multiple mono audios with varying durations for each event.
arXiv Detail & Related papers (2025-07-22T13:16:07Z)
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z)
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation [32.24603883810094]
Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.<n>We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions.<n>By leveraging spatial guidance, our model achieves the objective of generating immersive and controllable spatial audio from text.
arXiv Detail & Related papers (2024-10-14T16:18:29Z)
Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location. Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z)
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks. They often suffer from common issues such as semantic misalignment and poor temporal consistency. We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z)
DiffAVA: Personalized Text-to-Audio Generation with Visual Alignment [30.38594416942543]
We propose a novel and personalized text-to-sound generation approach with visual alignment based on latent diffusion models, namely DiffAVA. Our DiffAVA leverages a multi-head attention transformer to aggregate temporal information from video features, and a dual multi-modal residual network to fuse temporal visual representations with text embeddings. Experimental results on the AudioCaps dataset demonstrate that the proposed DiffAVA can achieve competitive performance on visual-aligned text-to-audio generation.
arXiv Detail & Related papers (2023-05-22T10:37:27Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture. We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z)
AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.