Related papers: Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

URL: http://arxiv.org/abs/2410.20336v1
Date: Sun, 27 Oct 2024 04:28:57 GMT
Title: Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation
Authors: Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting Lu, Mike Seltzer, Qing He,
Abstract summary: Large language models (LLMs) have revolutionized natural language processing (NLP) We introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. We further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture.
Score: 14.746190461312036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Related papers

FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation [3.8125534288516683]
FastSLM is a lightweight yet efficient speech-language model (SLM) designed for effective understanding and reasoning over long-form speech.<n>We present a novel three-stage training strategy that enhances generalization across a wide range of speech-related tasks.<n> Experimental results demonstrate that FastSLM achieves competitive performance compared to existing state-of-the-art models.
arXiv Detail & Related papers (2026-01-08T07:46:03Z)
Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving [36.246791887458194]
Large language models (LLMs) have shown remarkable generalization across tasks.<n>LLMs typically use supervised fine-tuning to align speech with text-based LLMs.<n>We propose a novel multi-task 'behavior imitation' method with speech-text interleaving.
arXiv Detail & Related papers (2025-05-24T11:09:13Z)
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025 [56.61209412965054]
We present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks.<n>We propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context.<n>For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks.
arXiv Detail & Related papers (2025-05-19T12:21:29Z)
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling [46.60911294356232]
We introduce Text-Aligned Speech Tokenization and Embedding (TASTE) TASTE is a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length.
arXiv Detail & Related papers (2025-04-09T17:14:33Z)
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM [35.443850239910866]
We propose a lightweight, autoregressive streaming TTS system that generates high-quality speech with low latency. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score.
arXiv Detail & Related papers (2025-03-06T18:59:38Z)
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM [44.59026505152727]
This paper proposes a novel speech-text multimodal LLM architecture called Freeze- Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training.
arXiv Detail & Related papers (2024-11-01T17:59:51Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone. Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) We present a simple yet effective automatic process for creating speech-text pair data. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing. We reformulate speech processing tasks into speech-to-unit generation tasks. We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z)
BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 [25.644228338604815]
We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture.
arXiv Detail & Related papers (2024-06-28T14:40:03Z)
Speech Translation with Large Language Models: An Industrial Practice [64.5419534101104]
We introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained large language model (LLM) By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST.
arXiv Detail & Related papers (2023-12-21T05:32:49Z)
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z)
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.