Related papers: Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

URL: http://arxiv.org/abs/2305.18474v1
Date: Mon, 29 May 2023 10:41:28 GMT
Title: Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
Authors: Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao
Abstract summary: Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks. They often suffer from common issues such as semantic misalignment and poor temporal consistency. We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
Score: 72.7915031238824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured <event & order> pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

Related papers

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
TA-V2A: Textually Assisted Video-to-Audio Generation [9.957113952852051]
Video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. We present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space.
arXiv Detail & Related papers (2025-03-12T06:43:24Z)
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model [48.57556892287629]
We propose a text-to-spatial-audio (TTSA) generation framework named DualSpec. It first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation.
arXiv Detail & Related papers (2025-02-26T09:01:59Z)
OMCAT: Omni Context Aware Transformer [27.674943980306423]
OCTAV is a novel dataset designed to capture event transitions across audio and video. OMCAT is a powerful model that leverages RoTE to enhance temporal grounding and computational efficiency in time-anchored tasks. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment.
arXiv Detail & Related papers (2024-10-15T23:16:28Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z)
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z)
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation [13.626626326590086]
We introduce Auffusion, a Text-to-Image (T2I) system adapting T2I model frameworks to Text-to-Audio (TTA) task. Our evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions.
arXiv Detail & Related papers (2024-01-02T05:42:14Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
Two-Pass Low Latency End-to-End Spoken Language Understanding [36.81762807197944]
We incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations. We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass. Our code and models are publicly available as part of the ESPnet-SLU toolkit.
arXiv Detail & Related papers (2022-07-14T05:50:16Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.