BoSS: Beyond-Semantic Speech
- URL: http://arxiv.org/abs/2507.17563v1
- Date: Wed, 23 Jul 2025 14:53:50 GMT
- Title: BoSS: Beyond-Semantic Speech
- Authors: Qing Wang, Zehan Li, Hang Lv, Hongjie Chen, Yaodong Song, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li,
- Abstract summary: Beyond-Semantic Speech (BoSS) refers to the set of information in speech communication that encompasses but transcends explicit semantics.<n>We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics.<n>These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
- Score: 43.96461266560891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
Related papers
- GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness [43.67571101152883]
We introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness.<n> GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization.<n>We show that GOAT-SLM well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions.
arXiv Detail & Related papers (2025-07-24T06:10:29Z) - MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark [42.58439306999647]
MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks.<n>We ground our benchmark in linguistic theory, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics.<n>MMSU establishes a new standard for comprehensive assessment of spoken language understanding.
arXiv Detail & Related papers (2025-06-05T09:09:36Z) - Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data [33.85748258158527]
Empathetic dialogue is crucial for natural human-computer interaction.<n>Large language models (LLMs) have revolutionized dialogue generation by harnessing their powerful capabilities.<n>We propose a novel approach that circumvents the need for question-answering data.
arXiv Detail & Related papers (2025-01-19T04:10:53Z) - PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control [20.873353104077857]
We introduce an approach centered on prompt-based emotion control.<n>The proposed architecture incorporates emotion and intensity control across multi-speakers.<n>We leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content.
arXiv Detail & Related papers (2025-01-10T12:10:30Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Roadmap towards Superhuman Speech Understanding using Large Language Models [60.57947401837938]
Large language models (LLMs) integrate speech and audio data.
Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs.
We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
arXiv Detail & Related papers (2024-10-17T06:44:06Z) - Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling [40.32021786228235]
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting.
We propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS.
To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk)
arXiv Detail & Related papers (2024-10-12T13:02:31Z) - SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions.
Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - CMSBERT-CLR: Context-driven Modality Shifting BERT with Contrastive
Learning for linguistic, visual, acoustic Representations [0.7081604594416336]
We present a Context-driven Modality Shifting BERT with Contrastive Learning for linguistic, visual, acoustic Representations (CMSBERT-CLR)
CMSBERT-CLR incorporates the whole context's non-verbal and verbal information and aligns modalities more effectively through contrastive learning.
In our experiments, we demonstrate that our approach achieves state-of-the-art results.
arXiv Detail & Related papers (2022-08-21T08:21:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.