Related papers: Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment

Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment

URL: http://arxiv.org/abs/2601.01745v1
Date: Mon, 05 Jan 2026 02:43:04 GMT
Title: Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment
Authors: Hong Han, Hao-Chen Pei, Zhao-Zheng Nie, Xin Luo, Xin-Shun Xu,
Abstract summary: We propose a novel residual hierarchical interactive method, HIA, that enables bidirectional modeling across granularities.<n>We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies.<n>Our model is comprehensively ahead of the existing state-of-the-art methods.
Score: 18.97451964522765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic pronunciation assessment plays a crucial role in computer-assisted pronunciation training systems. Due to the ability to perform multiple pronunciation tasks simultaneously, multi-aspect multi-granularity pronunciation assessment methods are gradually receiving more attention and achieving better performance than single-level modeling tasks. However, existing methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction among phoneme, word, and utterance levels and thus insufficiently capturing the acoustic structural correlations. To address this issue, we propose a novel residual hierarchical interactive method, HIA for short, that enables bidirectional modeling across granularities. As the core of HIA, the Interactive Attention Module leverages an attention mechanism to achieve dynamic bidirectional interaction, effectively capturing linguistic features at each granularity while integrating correlations between different granularity levels. We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies. In addition, we use 1-D convolutional layers to enhance the extraction of local contextual cues at each granularity. Extensive experiments on the speechocean762 dataset show that our model is comprehensively ahead of the existing state-of-the-art methods.

Related papers

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation [48.6868174403074]
We introduce U-Mind, the first unified system for high-intelligence multimodal dialogue.<n>It supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.<n>We show that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks.
arXiv Detail & Related papers (2026-02-27T07:07:02Z)
Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z)
MuFFIN: Multifaceted Pronunciation Feedback Model with Interactive Hierarchical Neural Modeling [14.953695326450001]
We introduce MuFFIN, a Multi-Faceted pronunciation Feedback model with an Interactive hierarchical Neural architecture.<n>To better capture the nuanced distinctions between phonemes in the feature space, a novel phoneme-contrastive ordinal regularization mechanism is put forward.<n>We design a simple yet effective training objective, which is specifically tailored to perturb the outputs of a phoneme with the phoneme-specific variations.
arXiv Detail & Related papers (2025-10-06T15:54:55Z)
Mitigating Attention Hacking in Preference-Based Reward Modeling via Interaction Distillation [62.14692332209628]
"Interaction Distillation" is a novel training framework for more adequate preference modeling through attention-level optimization.<n>It provides more stable and generalizable reward signals compared to state-of-the-art RM optimization methods.
arXiv Detail & Related papers (2025-08-04T17:06:23Z)
Boosting Neural Language Inference via Cascaded Interactive Reasoning [38.125341836302525]
Natural Language Inference (NLI) focuses on ascertaining the logical relationship between a given premise and hypothesis.<n>This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances.<n>We introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI.
arXiv Detail & Related papers (2025-05-10T11:37:15Z)
Multi-Modal Self-Supervised Semantic Communication [52.76990720898666]
We propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction.<n>The proposed approach effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead.<n>The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.
arXiv Detail & Related papers (2025-03-18T06:13:02Z)
Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics. We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context. Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z)
Hierarchical Pronunciation Assessment with Multi-Aspect Attention [3.6825890616838066]
We propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model. HiPAMA hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention. Remarkable improvements in the experimental results on the speachocean datasets demonstrate the robustness of HiPAMA.
arXiv Detail & Related papers (2022-11-15T12:49:35Z)
HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction [24.853265244512954]
We propose a hierarchical contrastive learning Framework for DistantlySupervised relation extraction (HiCLRE) to reduce noisy sentences. Specifically, we propose a three-level hierarchical learning framework to interact with cross levels, generating the de-noising context-aware representations. Experiments demonstrate that HiCLRE significantly outperforms strong baselines in various mainstream DSRE datasets.
arXiv Detail & Related papers (2022-02-27T12:48:26Z)
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task. We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs. We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.