Related papers: SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

URL: http://arxiv.org/abs/2505.17060v1
Date: Sat, 17 May 2025 08:13:59 GMT
Title: SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation
Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang,
Abstract summary: SALMON-N-omni is the first single standalone full-byte speech LLM that operates without its token transition backbone.<n>It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when between speaking and listening.<n> SALMON-N-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling echo cancellation and context-dependent bargein.
Score: 17.56310064245171
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

Related papers

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model [85.72664004969182]
We introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks.<n>The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction.<n>Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence.
arXiv Detail & Related papers (2025-06-10T16:37:39Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z)
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM [35.443850239910866]
We propose a lightweight, autoregressive streaming TTS system that generates high-quality speech with low latency.<n>Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score.
arXiv Detail & Related papers (2025-03-06T18:59:38Z)
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation [17.56310064245171]
SALMON-omni is a speech understanding and generation model capable of simultaneously listening to its own generated speech sounds while speaking.<n> SALMON-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full- conversational AI systems.
arXiv Detail & Related papers (2024-11-27T08:38:57Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z)
Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses. To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback. We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z)
A Full-duplex Speech Dialogue Scheme Based On Large Language Models [23.994130020644842]
generative generative dialogue system capable in a full manner allowing seamless interaction. System generates tokens for inquiry responses and makes autonomous decisions to respond to wait for, or operate the user.
arXiv Detail & Related papers (2024-05-29T20:05:46Z)
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.