An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment
- URL: http://arxiv.org/abs/2509.03372v2
- Date: Sun, 21 Sep 2025 05:40:47 GMT
- Title: An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment
- Authors: Tien-Hong Lo, Szu-Yu Chen, Yao-Ting Sung, Berlin Chen,
- Abstract summary: Self-supervised learning representations capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation.<n>Most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels.<n>We propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm.
- Score: 14.003981407136072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A recent line of research on automated speaking assessment (ASA) has benefited from self-supervised learning (SSL) representations, which capture rich acoustic and linguistic patterns in non-native speech without underlying assumptions of feature curation. However, speech-based SSL models capture acoustic-related traits but overlook linguistic content, while text-based SSL models rely on ASR output and fail to encode prosodic nuances. Moreover, most prior arts treat proficiency levels as nominal classes, ignoring their ordinal structure and non-uniform intervals between proficiency labels. To address these limitations, we propose an effective ASA approach combining SSL with handcrafted indicator features via a novel modeling paradigm. We further introduce a multi-margin ordinal loss that jointly models both the score ordinality and non-uniform intervals of proficiency labels. Extensive experiments on the TEEMI corpus show that our method consistently outperforms strong baselines and generalizes well to unseen prompts.
Related papers
- Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z) - Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling [27.224093715611534]
We propose a novel framework for learning to detect medical conditions from speech acoustics.<n>Our end-to-end approach dynamically aggregates multi-granularity features and generates high-quality pseudo-labels.<n>This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.
arXiv Detail & Related papers (2026-01-08T09:10:16Z) - WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z) - New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR [30.00166986946003]
We take a new insight to regard alignment and matching as a detection problem.<n>The goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens.<n>We propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries.
arXiv Detail & Related papers (2025-09-06T05:58:52Z) - A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions [8.717610965852037]
We propose a novel training paradigm to generate diverse responses of a given proficiency level.<n>We convert responses into synthesized speech via speaker-aware text-to-speech synthesis.<n>A multimodal large language model integrates aligned textual features with speech signals to predict proficiency scores directly.
arXiv Detail & Related papers (2025-06-04T15:42:53Z) - Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.<n>Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.<n>We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.<n> Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z) - Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)<n>Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)<n>We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution [5.1660803395535835]
Self-supervised learning (SSL) has shown stellar performance compared to traditional methods.<n>However, SSL-based ASA systems are faced with at least three data-related challenges.<n>These challenges include limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels.
arXiv Detail & Related papers (2024-04-11T09:06:49Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - On Data-Augmentation and Consistency-Based Semi-Supervised Learning [77.57285768500225]
Recently proposed consistency-based Semi-Supervised Learning (SSL) methods have advanced the state of the art in several SSL tasks.
Despite these advances, the understanding of these methods is still relatively limited.
arXiv Detail & Related papers (2021-01-18T10:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.