LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
- URL: http://arxiv.org/abs/2505.02625v1
- Date: Mon, 05 May 2025 12:53:09 GMT
- Title: LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
- Authors: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng,
- Abstract summary: We introduce LLaMA- Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters.<n>LLaMA- Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder.
- Score: 43.533849239738394
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
Related papers
- MinMo: A Multimodal Large Language Model for Seamless Voice Interaction [73.39573341265027]
We introduce MinMo, a Multimodal Large Language Model for seamless voice interaction.<n>We train MinMo through stages speechtext-to-speech alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction.<n>After the multi-text training, MinMo state-of-the-art performance across various benchmarks for voice comprehension and generation.
arXiv Detail & Related papers (2025-01-10T15:55:27Z) - SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation [56.683846056788326]
We propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration.<n>We convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme.<n> Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
arXiv Detail & Related papers (2025-01-01T11:11:07Z) - GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot [30.866548518233433]
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbots.<n>It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions.
arXiv Detail & Related papers (2024-12-03T17:41:24Z) - Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM [44.59026505152727]
This paper proposes a novel speech-text multimodal LLM architecture called Freeze- Omni.<n>Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM.<n>In addition, we also designed a method to achieve duplex dialogue ability through multi-task training.
arXiv Detail & Related papers (2024-11-01T17:59:51Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - LLaMA-Omni: Seamless Speech Interaction with Large Language Models [43.28912243888652]
LLaMA- Omni is a novel model architecture designed for low-latency and high-quality speech interaction with large language models.<n>It integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder.<n>It provides better responses in both content and style, with a response latency as low as 226ms.
arXiv Detail & Related papers (2024-09-10T17:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.