Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation
- URL: http://arxiv.org/abs/2401.01044v1
- Date: Tue, 2 Jan 2024 05:42:14 GMT
- Title: Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation
- Authors: Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
- Abstract summary: We introduce Auffusion, a Text-to-Image (T2I) system adapting T2I model frameworks to Text-to-Audio (TTA) task.
Our evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource.
Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions.
- Score: 13.626626326590086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in diffusion models and large language models (LLMs) have
significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning
AIGC application designed to generate audio from natural language prompts, is
attracting increasing attention. However, existing TTA studies often struggle
with generation quality and text-audio alignment, especially for complex
textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I)
diffusion models, we introduce Auffusion, a TTA system adapting T2I model
frameworks to TTA task, by effectively leveraging their inherent generative
strengths and precise cross-modal alignment. Our objective and subjective
evaluations demonstrate that Auffusion surpasses previous TTA approaches using
limited data and computational resource. Furthermore, previous studies in T2I
recognizes the significant impact of encoder choice on cross-modal alignment,
like fine-grained details and object bindings, while similar evaluation is
lacking in prior TTA works. Through comprehensive ablation studies and
innovative cross-attention map visualizations, we provide insightful
assessments of text-audio alignment in TTA. Our findings reveal Auffusion's
superior capability in generating audios that accurately match textual
descriptions, which further demonstrated in several related tasks, such as
audio style transfer, inpainting and other manipulations. Our implementation
and demos are available at https://auffusion.github.io.
Related papers
- UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.19449187588832]
Unified Audio Language Model (UALM) aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model.<n>We first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models.<n>We present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks.
arXiv Detail & Related papers (2025-10-13T22:55:01Z) - ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling [26.333732366091912]
We recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio.<n>Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy.<n>Experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.
arXiv Detail & Related papers (2025-10-10T00:19:41Z) - From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training [19.396162898865864]
Text-to-Talk (TtT) is a unified audio-text framework that integrates autoregressive (AR) text generation with non-autoregressive (NAR) audio diffusion in a single Transformer.<n>To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text.<n>During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs.
arXiv Detail & Related papers (2025-09-24T12:44:26Z) - TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z) - ETTA: Elucidating the Design Space of Text-to-Audio Models [33.831803213869605]
We study the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks.
We propose our best model dubbed Elucidated Text-To-Audio (ETTA)
ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data.
arXiv Detail & Related papers (2024-12-26T21:13:12Z) - Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper.
Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner.
Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource
Parallel Data [15.658471125219224]
Multimodal pre-training for audio-and-text has been proven to be effective and has significantly improved the performance of many downstream speech understanding tasks.
However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data.
In this paper, we investigate whether it is possible to pre-train an audio-text model with low-resource parallel data.
arXiv Detail & Related papers (2022-04-10T10:25:37Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.