Text-to-Audio Generation using Instruction-Tuned LLM and Latent
Diffusion Model
- URL: http://arxiv.org/abs/2304.13731v2
- Date: Mon, 29 May 2023 12:09:08 GMT
- Title: Text-to-Audio Generation using Instruction-Tuned LLM and Latent
Diffusion Model
- Authors: Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Soujanya Poria
- Abstract summary: Large language models (LLM) allow many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning.
We adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation.
Our approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set.
- Score: 23.058939018350603
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The immense scale of the recent large language models (LLM) allows many
interesting properties, such as, instruction- and chain-of-thought-based
fine-tuning, that has significantly improved zero- and few-shot performance in
many natural language processing (NLP) tasks. Inspired by such successes, we
adopt such an instruction-tuned LLM Flan-T5 as the text encoder for
text-to-audio (TTA) generation -- a task where the goal is to generate an audio
from its textual description. The prior works on TTA either pre-trained a joint
text-audio encoder or used a non-instruction-tuned model, such as, T5.
Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms
the state-of-the-art AudioLDM on most metrics and stays comparable on the rest
on AudioCaps test set, despite training the LDM on a 63 times smaller dataset
and keeping the text encoder frozen. This improvement might also be attributed
to the adoption of audio pressure level-based sound mixing for training set
augmentation, whereas the prior methods take a random mix.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Chain-of-Thought Prompting for Speech Translation [33.77037760225061]
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation.
Recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance.
We propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM.
arXiv Detail & Related papers (2024-09-17T20:16:43Z) - Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR)
In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM.
Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - Enhancing the Stability of LLM-based Speech Generation Systems through
Self-Supervised Representations [14.437646262239612]
Self-supervised Voice Conversion (VC) architecture can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations.
Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model.
Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp
arXiv Detail & Related papers (2024-02-05T15:08:19Z) - Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM)
By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations.
Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - AudioLDM: Text-to-Audio Generation with Latent Diffusion Models [35.703877904270726]
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions.
In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining latents.
Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics.
arXiv Detail & Related papers (2023-01-29T17:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.