PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
- URL: http://arxiv.org/abs/2406.12428v2
- Date: Thu, 03 Oct 2024 05:17:25 GMT
- Title: PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
- Authors: Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada,
- Abstract summary: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems.
generating a spoken response requires the prior generation of a written response, and speech sequences are significantly longer than text sequences.
This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech.
- Score: 7.326036800127981
- License:
- Abstract: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.
Related papers
- Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates.
New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text.
This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Speak While You Think: Streaming Speech Synthesis During Text Generation [13.964169328257233]
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text.
We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction.
arXiv Detail & Related papers (2023-09-20T11:00:15Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.