Related papers: SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

URL: http://arxiv.org/abs/2411.18138v1
Date: Wed, 27 Nov 2024 08:38:57 GMT
Title: SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation
Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang,
Abstract summary: SALMON-omni is a speech understanding and generation model capable of simultaneously listening to its own generated speech sounds while speaking.<n> SALMON-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full- conversational AI systems.
Score: 17.56310064245171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.

Related papers

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation [17.56310064245171]
SALMON-N-omni is the first single standalone full-byte speech LLM that operates without its token transition backbone.<n>It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when between speaking and listening.<n> SALMON-N-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling echo cancellation and context-dependent bargein.
arXiv Detail & Related papers (2025-05-17T08:13:59Z)
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction [9.101978573666546]
Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
arXiv Detail & Related papers (2025-02-24T15:16:34Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z)
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer [39.31849739010572]
We introduce textbfGenerative textbfPre-trained textbfSpeech textbfTransformer (GPST) GPST is a hierarchical transformer designed for efficient speech language modeling.
arXiv Detail & Related papers (2024-06-03T04:16:30Z)
SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network. It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model. It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text. We first convert all the speech utterances to discrete tokens using an offline neural encoder. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.