IteraTTA: An interface for exploring both text prompts and audio priors
in generating music with text-to-audio models
- URL: http://arxiv.org/abs/2307.13005v1
- Date: Mon, 24 Jul 2023 11:00:01 GMT
- Title: IteraTTA: An interface for exploring both text prompts and audio priors
in generating music with text-to-audio models
- Authors: Hiromu Yakura and Masataka Goto
- Abstract summary: IteraTTA is designed to aid users in refining text prompts and selecting favorable audio priors from the generated audios.
Our implementation and discussions highlight design considerations that are specifically required for text-to-audio models.
- Score: 40.798454815430034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent text-to-audio generation techniques have the potential to allow novice
users to freely generate music audio. Even if they do not have musical
knowledge, such as about chord progressions and instruments, users can try
various text prompts to generate audio. However, compared to the image domain,
gaining a clear understanding of the space of possible music audios is
difficult because users cannot listen to the variations of the generated audios
simultaneously. We therefore facilitate users in exploring not only text
prompts but also audio priors that constrain the text-to-audio music generation
process. This dual-sided exploration enables users to discern the impact of
different text prompts and audio priors on the generation results through
iterative comparison of them. Our developed interface, IteraTTA, is
specifically designed to aid users in refining text prompts and selecting
favorable audio priors from the generated audios. With this, users can
progressively reach their loosely-specified goals while understanding and
exploring the space of possible results. Our implementation and discussions
highlight design considerations that are specifically required for
text-to-audio models and how interaction techniques can contribute to their
effectiveness.
Related papers
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization [70.13218512896032]
Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
arXiv Detail & Related papers (2024-04-15T17:31:22Z) - Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models [98.34889301515412]
We develop the Qwen-Audio model and address the limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types.
Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning.
We further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
arXiv Detail & Related papers (2023-11-14T05:34:50Z) - On The Open Prompt Challenge In Conditional Audio Generation [25.178010153697976]
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text.
We treat TTA models as a blackbox'' and address the user prompt challenge with two key insights.
We propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements.
arXiv Detail & Related papers (2023-11-01T23:33:25Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.