One Whisper to Grade Them All
- URL: http://arxiv.org/abs/2507.17918v1
- Date: Wed, 23 Jul 2025 20:31:40 GMT
- Title: One Whisper to Grade Them All
- Authors: Nhan Phan, Anusha Porwal, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo,
- Abstract summary: We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests.<n>Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder.<n>This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems.
- Score: 10.035434464829958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.
Related papers
- NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 [24.056321452209666]
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I)<n>We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies.
arXiv Detail & Related papers (2025-06-16T10:28:27Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach [42.27824690168642]
Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise.
We present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection.
arXiv Detail & Related papers (2024-06-25T15:57:02Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter
Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B.
Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z) - Low Resource German ASR with Untranscribed Data Spoken by Non-native
Children -- INTERSPEECH 2021 Shared Task SPAPL System [19.435571932141364]
This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German.
5 hours of transcribed data and 60 hours of untranscribed data are provided to develop a German ASR system for children.
For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances.
Our system achieves a word error rate (WER) of 39.68% on the evaluation data,
arXiv Detail & Related papers (2021-06-18T07:36:26Z) - Multilingual Speech Translation with Efficient Finetuning of Pretrained
Models [82.22294901727933]
A minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability.
Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model.
arXiv Detail & Related papers (2020-10-24T08:15:08Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Speech2Phone: A Novel and Efficient Method for Training Speaker
Recognition Models [0.697505859813853]
We present an efficient method for training models for speaker recognition using small or under-resourced datasets.
This method requires less data than other SOTA (State-Of-The-Art) methods, while achieving similar results to those methods.
arXiv Detail & Related papers (2020-02-25T22:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.