Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
- URL: http://arxiv.org/abs/2305.18474v1
- Date: Mon, 29 May 2023 10:41:28 GMT
- Title: Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
- Authors: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen
Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao
- Abstract summary: Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
- Score: 72.7915031238824
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large diffusion models have been successful in text-to-audio (T2A) synthesis
tasks, but they often suffer from common issues such as semantic misalignment
and poor temporal consistency due to limited natural language understanding and
data scarcity. Additionally, 2D spatial structures widely used in T2A works
lead to unsatisfactory audio quality when generating variable-length audio
samples since they do not adequately prioritize temporal information. To
address these challenges, we propose Make-an-Audio 2, a latent diffusion-based
T2A method that builds on the success of Make-an-Audio. Our approach includes
several techniques to improve semantic alignment and temporal consistency:
Firstly, we use pre-trained large language models (LLMs) to parse the text into
structured <event & order> pairs for better temporal information capture. We
also introduce another structured-text encoder to aid in learning semantic
alignment during the diffusion denoising process. To improve the performance of
variable length generation and enhance the temporal information extraction, we
design a feed-forward Transformer-based diffusion denoiser. Finally, we use
LLMs to augment and transform a large amount of audio-label data into
audio-text datasets to alleviate the problem of scarcity of temporal data.
Extensive experiments show that our method outperforms baseline models in both
objective and subjective metrics, and achieves significant gains in temporal
information understanding, semantic consistency, and sound quality.
Related papers
- OMCAT: Omni Context Aware Transformer [27.674943980306423]
OCTAV is a novel dataset designed to capture event transitions across audio and video.
OMCAT is a powerful model that leverages RoTE to enhance temporal grounding and computational efficiency in time-anchored tasks.
Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment.
arXiv Detail & Related papers (2024-10-15T23:16:28Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Auffusion: Leveraging the Power of Diffusion and Large Language Models
for Text-to-Audio Generation [13.626626326590086]
We introduce Auffusion, a Text-to-Image (T2I) system adapting T2I model frameworks to Text-to-Audio (TTA) task.
Our evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource.
Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions.
arXiv Detail & Related papers (2024-01-02T05:42:14Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Two-Pass Low Latency End-to-End Spoken Language Understanding [36.81762807197944]
We incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations.
We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass.
Our code and models are publicly available as part of the ESPnet-SLU toolkit.
arXiv Detail & Related papers (2022-07-14T05:50:16Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.