The NTNU System at the S&I Challenge 2025 SLA Open Track
- URL: http://arxiv.org/abs/2506.05121v2
- Date: Thu, 11 Sep 2025 16:44:28 GMT
- Title: The NTNU System at the S&I Challenge 2025 SLA Open Track
- Authors: Hong-Yun Lin, Tien-Hong Lo, Yu-Hsuan Fang, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen,
- Abstract summary: We propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy.<n>The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025.<n>For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
- Score: 10.11220261280201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
Related papers
- An Evaluation Study of Hybrid Methods for Multilingual PII Detection [0.026059379504241156]
We present RECAP, a framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection.<n>Our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score.<n>This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.
arXiv Detail & Related papers (2025-10-08T21:03:59Z) - The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties [107.57160730151975]
We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models.<n>The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18%.<n>On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy.
arXiv Detail & Related papers (2025-09-08T18:42:36Z) - SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work [87.9341538630949]
The first Sign Language Production Challenge was held as part of the third SLRTP Workshop at CVPR 2025.<n>The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses.<n>This paper presents the challenge design and the winning methodologies.
arXiv Detail & Related papers (2025-08-09T11:57:33Z) - MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task [7.247809853198223]
This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track.<n>Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system.
arXiv Detail & Related papers (2025-06-23T16:44:01Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Automatic Proficiency Assessment in L2 English Learners [51.652753736780205]
Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators.<n>This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription.
arXiv Detail & Related papers (2025-05-05T12:36:03Z) - Non-native Children's Automatic Speech Assessment Challenge (NOCASA) [15.921285405887009]
"NOCASA" is a data competition part of the IEEE MLSP 2025 conference.<n>It challenges participants to develop systems that can assess single-word pronunciations of young second language (L2) learners.<n>We provide a pseudo-anonymized training data (TeflonNorL2) containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words.
arXiv Detail & Related papers (2025-04-29T11:59:08Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems.
TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models.
Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Exploring Self-supervised Pre-trained ASR Models For Dysarthric and
Elderly Speech Recognition [57.31233839489528]
This paper explores approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition.
arXiv Detail & Related papers (2023-02-28T13:39:17Z) - Nonwords Pronunciation Classification in Language Development Tests for
Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate.
In this work, the task is to determine whether spoken nonwords have been uttered correctly.
We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z) - ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks [8.651248939672769]
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation.
We build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR.
Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models.
arXiv Detail & Related papers (2022-05-04T10:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.