Related papers: SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

URL: http://arxiv.org/abs/2505.15670v4
Date: Fri, 25 Jul 2025 15:07:10 GMT
Title: SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Authors: Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg,
Abstract summary: We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and agent outputs with channel fusion.<n>Using a pretrained streaming for user input enables the first duplex S2S model without requiring speech pretrain.<n> Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities.
Score: 28.42203609938444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.

Related papers

TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models [2.176487921193175]
TalkingMachines is an efficient framework that transforms pretrained video generation models into real-time, audio-driven character animators.<n>TalkingMachines enables natural conversational experiences by integrating an audio large language model (LLM) with our video generation foundation model.
arXiv Detail & Related papers (2025-06-03T17:29:28Z)
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [70.25062476543091]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.<n>Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.<n>We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z)
Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z)
Duplex Diffusion Models Improve Speech-to-Speech Translation [1.4649095013539173]
Speech-to-speech translation is a sequence-to-sequence learning task that naturally has two directions. We propose a duplex diffusion model that applies diffusion probabilistic models to both sides of a reversible duplex Conformer. Our model enables reversible speech translation by simply flipping the input and output ends.
arXiv Detail & Related papers (2023-05-22T01:39:40Z)
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z)
Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.