Related papers: OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

URL: http://arxiv.org/abs/2507.05177v2
Date: Tue, 08 Jul 2025 14:28:55 GMT
Title: OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang,
Abstract summary: We present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions.<n>Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S employs a streaming interleaved decoding architecture to achieve low-latency speech generation.<n>By leveraging large language models to generate empathetic content and controllable text-to-speech systems, we construct a scalable training corpus with rich paralinguistic diversity.
Score: 47.84522683404745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

Related papers

ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation [30.006550552714938]
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information.<n>Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations.<n>We propose textbfES4R, a framework for speech-based empathetic response generation.
arXiv Detail & Related papers (2026-01-16T10:26:50Z)
Empathy Omni: Enabling Empathetic Speech Response Generation through Large Language Models [38.5764934392601]
We propose Emotion Omni, a model that understands emotional content in user speech and generates empathetic responses.<n>Emotion Omni achieves comparable instruction-following ability without large-scale pretraining, while surpassing existing models in speech quality.
arXiv Detail & Related papers (2025-08-26T03:54:39Z)
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.<n>Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.<n>We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech [29.847183061204436]
This work studies the capabilities of a large language model (LLM) to understand paralinguistic aspects of speech without fine-tuning its weights.<n>We utilize an end-to-end system with a speech encoder, which is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt.
arXiv Detail & Related papers (2024-10-02T01:32:47Z)
Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels. Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z)
BLSP-Emo: Towards Empathetic Large Speech-Language Models [34.62210186235263]
We present BLSP-Emo, a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses.
arXiv Detail & Related papers (2024-06-06T09:02:31Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.