Audio Generation with Multiple Conditional Diffusion Model
- URL: http://arxiv.org/abs/2308.11940v4
- Date: Thu, 28 Dec 2023 10:59:54 GMT
- Title: Audio Generation with Multiple Conditional Diffusion Model
- Authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong
Liu, Xiangdong Wang
- Abstract summary: We propose a novel model that enhances the controllability of existing pre-trained text-to-audio models.
This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio.
- Score: 15.250081484817324
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-based audio generation models have limitations as they cannot encompass
all the information in audio, leading to restricted controllability when
relying solely on text. To address this issue, we propose a novel model that
enhances the controllability of existing pre-trained text-to-audio models by
incorporating additional conditions including content (timestamp) and style
(pitch contour and energy contour) as supplements to the text. This approach
achieves fine-grained control over the temporal order, pitch, and energy of
generated audio. To preserve the diversity of generation, we employ a trainable
control condition encoder that is enhanced by a large language model and a
trainable Fusion-Net to encode and fuse the additional conditions while keeping
the weights of the pre-trained text-to-audio model frozen. Due to the lack of
suitable datasets and evaluation metrics, we consolidate existing datasets into
a new dataset comprising the audio and corresponding conditions and use a
series of evaluation metrics to evaluate the controllability performance.
Experimental results demonstrate that our model successfully achieves
fine-grained control to accomplish controllable audio generation. Audio samples
and our dataset are publicly available at
https://conditionaudiogen.github.io/conditionaudiogen/
Related papers
- Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec [50.273832905535485]
We present ControlSpeech, a text-to-speech (TTS) system capable of fully mimicking the speaker's voice and enabling arbitrary control and adjustment of speaking style.
Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation.
arXiv Detail & Related papers (2024-06-03T11:15:16Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Text-Driven Foley Sound Generation With Latent Diffusion Model [33.4636070590045]
Foley sound generation aims to synthesise the background sound for multimedia content.
We propose a diffusion model based system for Foley sound generation with text conditions.
arXiv Detail & Related papers (2023-06-17T14:16:24Z) - AudioToken: Adaptation of Text-Conditioned Diffusion Models for
Audio-to-Image Generation [89.63430567887718]
We propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings.
Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations.
arXiv Detail & Related papers (2023-05-22T14:02:44Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Mix and Match: Learning-free Controllable Text Generation using Energy
Language Models [33.97800741890231]
We propose Mix and Match LM, a global score-based alternative for controllable text generation.
We interpret the task of controllable generation as drawing samples from an energy-based model.
We use a Metropolis-Hastings sampling scheme to sample from this energy-based model.
arXiv Detail & Related papers (2022-03-24T18:52:09Z) - EdiTTS: Score-based Editing for Controllable Text-to-Speech [9.34612743192798]
EdiTTS is an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis.
We apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model.
Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.
arXiv Detail & Related papers (2021-10-06T08:51:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.