LLaMA-Omni: Seamless Speech Interaction with Large Language Models
- URL: http://arxiv.org/abs/2409.06666v1
- Date: Tue, 10 Sep 2024 17:34:34 GMT
- Title: LLaMA-Omni: Seamless Speech Interaction with Large Language Models
- Authors: Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng,
- Abstract summary: LLaMA- Omni is a novel model architecture designed for low-latency and high-quality speech interaction with large language models.
It integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder.
It provides better responses in both content and style, with a response latency as low as 226ms.
- Score: 43.28912243888652
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.
Related papers
- MinMo: A Multimodal Large Language Model for Seamless Voice Interaction [73.39573341265027]
We introduce MinMo, a Multimodal Large Language Model for seamless voice interaction.
We train MinMo through stages speechtext-to-speech alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction.
After the multi-text training, MinMo state-of-the-art performance across various benchmarks for voice comprehension and generation.
arXiv Detail & Related papers (2025-01-10T15:55:27Z) - Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates.
New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM [44.59026505152727]
This paper proposes a novel speech-text multimodal LLM architecture called Freeze- Omni.
Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM.
In addition, we also designed a method to achieve duplex dialogue ability through multi-task training.
arXiv Detail & Related papers (2024-11-01T17:59:51Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Speak While You Think: Streaming Speech Synthesis During Text Generation [13.964169328257233]
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text.
We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction.
arXiv Detail & Related papers (2023-09-20T11:00:15Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.