Related papers: Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment

URL: http://arxiv.org/abs/2510.16387v1
Date: Sat, 18 Oct 2025 08:10:24 GMT
Title: Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment
Authors: Fu-An Chao, Bi-Cheng Yan, Berlin Chen,
Abstract summary: We explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model.<n>Our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations.<n>We conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech.
Score: 17.656808708384435
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we explore the untapped potential of Whisper, a well-established automatic speech recognition (ASR) foundation model, in the context of L2 spoken language assessment (SLA). Unlike prior studies that extrinsically analyze transcriptions produced by Whisper, our approach goes a step further to probe its latent capabilities by extracting acoustic and linguistic features from hidden representations. With only a lightweight classifier being trained on top of Whisper's intermediate and final outputs, our method achieves strong performance on the GEPT picture-description dataset, outperforming existing cutting-edge baselines, including a multimodal approach. Furthermore, by incorporating image and text-prompt information as auxiliary relevance cues, we demonstrate additional performance gains. Finally, we conduct an in-depth analysis of Whisper's embeddings, which reveals that, even without task-specific fine-tuning, the model intrinsically encodes both ordinal proficiency patterns and semantic aspects of speech, highlighting its potential as a powerful foundation for SLA and other spoken language understanding tasks.

Related papers

Harmonizing the Arabic Audio Space with Data Scheduling [15.84874997729878]
This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM.<n>We fine-tune Qwen2.5- Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS)<n>Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence, its inherent gradient volatility can destabilize generative decoding under prolonged training.
arXiv Detail & Related papers (2026-01-18T17:08:31Z)
Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning [8.717610965852037]
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech.<n>This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass.
arXiv Detail & Related papers (2025-09-19T14:33:05Z)
TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models [27.013776992438086]
We propose Text-Embedding KNN for SICL (TICL) to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning.<n>Our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction.
arXiv Detail & Related papers (2025-09-16T17:07:23Z)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction [87.49303116989708]
We explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE.<n>In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals.<n>Without any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility.
arXiv Detail & Related papers (2025-06-11T14:36:26Z)
Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis [2.0499240875882]
We introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings.<n>We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance.<n>These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.
arXiv Detail & Related papers (2025-02-26T18:32:15Z)
Roadmap towards Superhuman Speech Understanding using Large Language Models [60.57947401837938]
Large language models (LLMs) integrate speech and audio data. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs. We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
arXiv Detail & Related papers (2024-10-17T06:44:06Z)
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities. Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z)
Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts [14.999359332108767]
We propose DistilWhisper to bridge the performance gap in ASR for under-represented languages. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters.
arXiv Detail & Related papers (2023-11-02T08:37:30Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method. We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.