Related papers: Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

URL: http://arxiv.org/abs/2304.13731v2
Date: Mon, 29 May 2023 12:09:08 GMT
Title: Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
Authors: Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Soujanya Poria
Abstract summary: Large language models (LLM) allow many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning. We adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation. Our approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set.
Score: 23.058939018350603
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

Related papers

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment [94.0709779805955]
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM)<n>It is designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning.<n>DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks.
arXiv Detail & Related papers (2025-07-03T16:28:25Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens [19.48089933713418]
We introduce a novel approach that combines Variational Quantization with Flow Matching to convert audio into ultra-low discrete tokens of 0.23kpbs. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events.
arXiv Detail & Related papers (2025-03-28T09:43:47Z)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z)
Chain-of-Thought Prompting for Speech Translation [33.77037760225061]
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance. We propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM.
arXiv Detail & Related papers (2024-09-17T20:16:43Z)
Advancing Multi-talker ASR Performance with Large Language Models [48.52252970956368]
Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR) In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM. Our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI.
arXiv Detail & Related papers (2024-08-30T17:29:25Z)
C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together. C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities. Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z)
Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations [14.437646262239612]
Self-supervised Voice Conversion (VC) architecture can be used to learn to encode transitory features, such as content, separately from stationary ones, such as speaker ID or recording conditions, creating speaker-disentangled representations. Using speaker-disentangled codes to train LLMs for text-to-speech (TTS) allows the LLM to generate the content and the style of the speech only from the text, similarly to humans, while the speaker identity is provided by the decoder of the VC model. Results show that LLMs trained over speaker-disentangled self-supervised representations provide an improvement of 4.7pp
arXiv Detail & Related papers (2024-02-05T15:08:19Z)
Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM) By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z)
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models [35.703877904270726]
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining latents. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics.
arXiv Detail & Related papers (2023-01-29T17:48:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.