Related papers: Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment

Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment

URL: http://arxiv.org/abs/2509.16876v1
Date: Sun, 21 Sep 2025 02:04:52 GMT
Title: Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment
Authors: Jiun-Ting Li, Bi-Cheng Yan, Yi-Cheng Wang, Berlin Chen,
Abstract summary: Automatic pronunciation assessment (APA) analyzes second-language (L2) learners' speech by providing fine-grained pronunciation feedback.<n>Most existing efforts on APA typically adopt segmental-level features as inputs and predict pronunciation scores at different granularities.<n>We introduce multi-task pretraining (MTP) for APA, a simple yet effective strategy that attempts to capture long-term temporal pronunciation cues.
Score: 21.12585023191302
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automatic pronunciation assessment (APA) analyzes second-language (L2) learners' speech by providing fine-grained pronunciation feedback at various linguistic levels. Most existing efforts on APA typically adopt segmental-level features as inputs and predict pronunciation scores at different granularities via hierarchical (or parallel) pronunciation modeling. This, however, inevitably causes assessments across linguistic levels (e.g., phone, word, and utterance) to rely solely on phoneme-level pronunciation features, nearly sidelining supra-segmental pronunciation cues. To address this limitation, we introduce multi-task pretraining (MTP) for APA, a simple yet effective strategy that attempts to capture long-term temporal pronunciation cues while strengthening the intrinsic structures within an utterance via the objective of reconstructing input features. Specifically, for a phoneme-level encoder of an APA model, the proposed MTP strategy randomly masks segmental-level pronunciation features and reconstructs the masked ones based on their surrounding pronunciation context. Furthermore, current APA systems lack integration with automated speaking assessment (ASA), limiting holistic proficiency evaluation. Drawing on empirical studies and prior knowledge in ASA, our framework bridges this gap by incorporating handcrafted features (HCFs), such as fluency (speech rate, silence duration) and stress (pitch accent strength), derived from human-designed formulas via regressors to generate interpretable proficiency scores. Experiments on speechocean762 show improved pronunciation scoring and ASA proficiency correlation, enabling targeted training and comprehensive proficiency assessment.

Related papers

PRiSM: Benchmarking Phone Realization in Speech Models [70.82595415252682]
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis.<n>We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception.
arXiv Detail & Related papers (2026-01-20T15:00:36Z)
WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z)
Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment [18.97451964522765]
We propose a novel residual hierarchical interactive method, HIA, that enables bidirectional modeling across granularities.<n>We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies.<n>Our model is comprehensively ahead of the existing state-of-the-art methods.
arXiv Detail & Related papers (2026-01-05T02:43:04Z)
MuFFIN: Multifaceted Pronunciation Feedback Model with Interactive Hierarchical Neural Modeling [14.953695326450001]
We introduce MuFFIN, a Multi-Faceted pronunciation Feedback model with an Interactive hierarchical Neural architecture.<n>To better capture the nuanced distinctions between phonemes in the feature space, a novel phoneme-contrastive ordinal regularization mechanism is put forward.<n>We design a simple yet effective training objective, which is specifically tailored to perturb the outputs of a phoneme with the phoneme-specific variations.
arXiv Detail & Related papers (2025-10-06T15:54:55Z)
Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning [8.717610965852037]
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech.<n>This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass.
arXiv Detail & Related papers (2025-09-19T14:33:05Z)
Towards Inclusive Communication: A Unified LLM-Based Framework for Sign Language, Lip Movements, and Audio Understanding [52.859261069569165]
We introduce the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z)
Segmentation-free Goodness of Pronunciation [8.859671910259244]
Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems.<n>Most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units.<n>In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD.<n>Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF)
arXiv Detail & Related papers (2025-07-18T04:00:58Z)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction [87.49303116989708]
We explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE.<n>In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals.<n>Without any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility.
arXiv Detail & Related papers (2025-06-11T14:36:26Z)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance. We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information. Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z)
Hierarchical Pronunciation Assessment with Multi-Aspect Attention [3.6825890616838066]
We propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model. HiPAMA hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention. Remarkable improvements in the experimental results on the speachocean datasets demonstrate the robustness of HiPAMA.
arXiv Detail & Related papers (2022-11-15T12:49:35Z)
Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks. We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking. We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.