Related papers: Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities

Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities

URL: http://arxiv.org/abs/2601.16230v1
Date: Tue, 20 Jan 2026 15:48:38 GMT
Title: Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities
Authors: Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik,
Abstract summary: This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances.<n>The model generates scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance.
Score: 8.300738063140129
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An accurate assessment of L2 English pronunciation is crucial for language learning, as it provides personalized feedback and ensures a fair evaluation of individual progress. However, automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness. This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection. These findings demonstrate the strong potential of speech LLMs in scalable pronunciation assessment and suggest future improvements through enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.

Related papers

SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models [76.07833875692722]
Speech-based Intelligence Quotient (SIQ) is a new form of human cognition-inspired evaluation pipeline for voice understanding large language models.<n>Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks.
arXiv Detail & Related papers (2025-07-25T15:12:06Z)
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice [52.747242157396315]
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry.<n>We introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities.
arXiv Detail & Related papers (2025-07-23T14:07:41Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
Automatic Proficiency Assessment in L2 English Learners [51.652753736780205]
Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators.<n>This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription.
arXiv Detail & Related papers (2025-05-05T12:36:03Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings.<n>This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation.<n>We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback [28.53752312060031]
Speak & Improve Corpus 2025 is a dataset of L2 learner English data with holistic scores and language error annotation.<n>The aim of the corpus release is to address a major challenge to developing L2 spoken language processing systems.<n>It is being made available for non-commercial use on the ELiT website.
arXiv Detail & Related papers (2024-12-16T17:07:26Z)
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction. Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z)
Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates. Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z)
SpeechLMScore: Evaluating speech generation using speech language model [43.20067175503602]
We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model. It does not require human annotation and is a highly scalable framework. Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks.
arXiv Detail & Related papers (2022-12-08T21:00:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.