Related papers: Hierarchical Pronunciation Assessment with Multi-Aspect Attention

Hierarchical Pronunciation Assessment with Multi-Aspect Attention

URL: http://arxiv.org/abs/2211.08102v2
Date: Fri, 26 May 2023 06:06:53 GMT
Title: Hierarchical Pronunciation Assessment with Multi-Aspect Attention
Authors: Heejin Do, Yunsu Kim, Gary Geunbae Lee
Abstract summary: We propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model. HiPAMA hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention. Remarkable improvements in the experimental results on the speachocean datasets demonstrate the robustness of HiPAMA.
Score: 3.6825890616838066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic pronunciation assessment is a major component of a computer-assisted pronunciation training system. To provide in-depth feedback, scoring pronunciation at various levels of granularity such as phoneme, word, and utterance, with diverse aspects such as accuracy, fluency, and completeness, is essential. However, existing multi-aspect multi-granularity methods simultaneously predict all aspects at all granularity levels; therefore, they have difficulty in capturing the linguistic hierarchy of phoneme, word, and utterance. This limitation further leads to neglecting intimate cross-aspect relations at the same linguistic unit. In this paper, we propose a Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model, which hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi-aspect attention that reflects associations across aspects at the same level to create more connotative representations. By obtaining relational information from both the granularity- and aspect-side, HiPAMA can take full advantage of multi-task learning. Remarkable improvements in the experimental results on the speachocean762 datasets demonstrate the robustness of HiPAMA, particularly in the difficult-to-assess aspects.

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment [18.97451964522765]
We propose a novel residual hierarchical interactive method, HIA, that enables bidirectional modeling across granularities.<n>We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies.<n>Our model is comprehensively ahead of the existing state-of-the-art methods.
arXiv Detail & Related papers (2026-01-05T02:43:04Z)
Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment [21.12585023191302]
Automatic pronunciation assessment (APA) analyzes second-language (L2) learners' speech by providing fine-grained pronunciation feedback.<n>Most existing efforts on APA typically adopt segmental-level features as inputs and predict pronunciation scores at different granularities.<n>We introduce multi-task pretraining (MTP) for APA, a simple yet effective strategy that attempts to capture long-term temporal pronunciation cues.
arXiv Detail & Related papers (2025-09-21T02:04:52Z)
MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z)
Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z)
Contrastive Speaker-Aware Learning for Multi-party Dialogue Generation with LLMs [4.691083532629246]
Multi-party dialogue generation presents significant challenges due to the complex interplay of multiple speakers and interwoven conversational threads. This paper introduces Speaker-Attentive LLM (SA-LLM), a novel generative model that leverages pre-trained Large Language Models (LLMs) and a speaker-aware contrastive learning strategy to address these challenges. SA-LLM incorporates a speaker-attributed input encoding and a contrastive learning objective to implicitly learn contextual coherence and speaker roles without explicit relation annotations.
arXiv Detail & Related papers (2025-03-11T19:28:12Z)
Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information [68.89000132126536]
This work proposes to use inter-utterance linguistic information to improve the performance of prosodic structure prediction (PSP) Our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH)
arXiv Detail & Related papers (2023-08-31T09:19:15Z)
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? [34.609984453754656]
We aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark.
arXiv Detail & Related papers (2023-08-24T16:17:40Z)
Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics. We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context. Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment [10.809349710149533]
We train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning. Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.
arXiv Detail & Related papers (2022-05-06T18:07:44Z)
Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement [18.488808141923492]
ADENet is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling. Cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning.
arXiv Detail & Related papers (2022-03-04T09:53:19Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
Learning to Select Context in a Hierarchical and Global Perspective for Open-domain Dialogue Generation [15.01710843286394]
We propose a novel model with hierarchical self-attention mechanism and distant supervision to detect relevant words and utterances in short and long distances. Our model significantly outperforms other baselines in terms of fluency, coherence, and informativeness.
arXiv Detail & Related papers (2021-02-18T11:56:42Z)
Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
Masking Orchestration: Multi-task Pretraining for Multi-role Dialogue Representation Learning [50.5572111079898]
Multi-role dialogue understanding comprises a wide range of diverse tasks such as question answering, act classification, dialogue summarization etc. While dialogue corpora are abundantly available, labeled data, for specific learning tasks, can be highly scarce and expensive. In this work, we investigate dialogue context representation learning with various types unsupervised pretraining tasks.
arXiv Detail & Related papers (2020-02-27T04:36:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.