Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation
- URL: http://arxiv.org/abs/2401.01044v1
- Date: Tue, 2 Jan 2024 05:42:14 GMT
- Title: Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation
- Authors: Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
- Abstract summary: We introduce Auffusion, a Text-to-Image (T2I) system adapting T2I model frameworks to Text-to-Audio (TTA) task.
Our evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource.
Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions.
- Score: 13.626626326590086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in diffusion models and large language models (LLMs) have
significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning
AIGC application designed to generate audio from natural language prompts, is
attracting increasing attention. However, existing TTA studies often struggle
with generation quality and text-audio alignment, especially for complex
textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I)
diffusion models, we introduce Auffusion, a TTA system adapting T2I model
frameworks to TTA task, by effectively leveraging their inherent generative
strengths and precise cross-modal alignment. Our objective and subjective
evaluations demonstrate that Auffusion surpasses previous TTA approaches using
limited data and computational resource. Furthermore, previous studies in T2I
recognizes the significant impact of encoder choice on cross-modal alignment,
like fine-grained details and object bindings, while similar evaluation is
lacking in prior TTA works. Through comprehensive ablation studies and
innovative cross-attention map visualizations, we provide insightful
assessments of text-audio alignment in TTA. Our findings reveal Auffusion's
superior capability in generating audios that accurately match textual
descriptions, which further demonstrated in several related tasks, such as
audio style transfer, inpainting and other manipulations. Our implementation
and demos are available at https://auffusion.github.io.
Related papers
- Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling [13.757256085713571]
We present a novel two-stage prediction pipeline, named TAP-FM, proposed in this paper.
Specifically, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner.
Our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations.
arXiv Detail & Related papers (2024-04-14T08:56:19Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource
Parallel Data [15.658471125219224]
Multimodal pre-training for audio-and-text has been proven to be effective and has significantly improved the performance of many downstream speech understanding tasks.
However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data.
In this paper, we investigate whether it is possible to pre-train an audio-text model with low-resource parallel data.
arXiv Detail & Related papers (2022-04-10T10:25:37Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.