Related papers: SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

URL: http://arxiv.org/abs/2508.20916v2
Date: Mon, 10 Nov 2025 07:45:48 GMT
Title: SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
Authors: Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu,
Abstract summary: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction.<n>We propose textttSageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation.
Score: 74.51476422119457
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.

Related papers

Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning [8.717610965852037]
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech.<n>This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass.
arXiv Detail & Related papers (2025-09-19T14:33:05Z)
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols [25.171718066659775]
MTalk-Bench is a multi-turn S2S benchmark covering three core dimensions: Semantic Information, Paralinguistic Information, and Ambient Sound.<n>Our dual-method evaluation framework combines Arena-style evaluation (pairwise comparison) and evaluation (absolute scoring) for relative and absolute assessment.<n>Results highlight current limitations in S2S evaluation and the need for more robust, speech-aware assessment frameworks.
arXiv Detail & Related papers (2025-08-22T12:14:17Z)
A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations [25.58593495281218]
We propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner.<n>By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio.
arXiv Detail & Related papers (2025-06-26T01:54:02Z)
Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)<n>Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)<n>We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.