On The Open Prompt Challenge In Conditional Audio Generation
- URL: http://arxiv.org/abs/2311.00897v1
- Date: Wed, 1 Nov 2023 23:33:25 GMT
- Title: On The Open Prompt Challenge In Conditional Audio Generation
- Authors: Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun
Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang
Shi and Vikas Chandra
- Abstract summary: Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text.
We treat TTA models as a blackbox'' and address the user prompt challenge with two key insights.
We propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements.
- Score: 25.178010153697976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-audio generation (TTA) produces audio from a text description,
learning from pairs of audio samples and hand-annotated text. However,
commercializing audio generation is challenging as user-input prompts are often
under-specified when compared to text descriptions used to train TTA models. In
this work, we treat TTA models as a ``blackbox'' and address the user prompt
challenge with two key insights: (1) User prompts are generally
under-specified, leading to a large alignment gap between user prompts and
training prompts. (2) There is a distribution of audio descriptions for which
TTA models are better at generating higher quality audio, which we refer to as
``audionese''. To this end, we rewrite prompts with instruction-tuned models
and propose utilizing text-audio alignment as feedback signals via margin
ranking learning for audio improvements. On both objective and subjective human
evaluations, we observed marked improvements in both text-audio alignment and
music audio quality.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z) - IteraTTA: An interface for exploring both text prompts and audio priors
in generating music with text-to-audio models [40.798454815430034]
IteraTTA is designed to aid users in refining text prompts and selecting favorable audio priors from the generated audios.
Our implementation and discussions highlight design considerations that are specifically required for text-to-audio models.
arXiv Detail & Related papers (2023-07-24T11:00:01Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.