Preserving Phonemic Distinctions for Ordinal Regression: A Novel Loss
Function for Automatic Pronunciation Assessment
- URL: http://arxiv.org/abs/2310.01839v2
- Date: Wed, 4 Oct 2023 06:51:24 GMT
- Title: Preserving Phonemic Distinctions for Ordinal Regression: A Novel Loss
Function for Automatic Pronunciation Assessment
- Authors: Bi-Cheng Yan, Hsin-Wei Wang, Yi-Cheng Wang, Jiun-Ting Li, Chi-Han Lin,
Berlin Chen
- Abstract summary: We propose a phonemic contrast ordinal (PCO) loss for training regression-based APA models.
Specifically, we introduce a phoneme-distinct regularizer into the MSE loss, which encourages feature representations of different phoneme categories to be far apart.
An extensive set of experiments carried out on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our model.
- Score: 10.844822448167937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic pronunciation assessment (APA) manages to quantify the
pronunciation proficiency of a second language (L2) learner in a language.
Prevailing approaches to APA normally leverage neural models trained with a
regression loss function, such as the mean-squared error (MSE) loss, for
proficiency level prediction. Despite most regression models can effectively
capture the ordinality of proficiency levels in the feature space, they are
confronted with a primary obstacle that different phoneme categories with the
same proficiency level are inevitably forced to be close to each other,
retaining less phoneme-discriminative information. On account of this, we
devise a phonemic contrast ordinal (PCO) loss for training regression-based APA
models, which aims to preserve better phonemic distinctions between phoneme
categories meanwhile considering ordinal relationships of the regression target
output. Specifically, we introduce a phoneme-distinct regularizer into the MSE
loss, which encourages feature representations of different phoneme categories
to be far apart while simultaneously pulling closer the representations
belonging to the same phoneme category by means of weighted distances. An
extensive set of experiments carried out on the speechocean762 benchmark
dataset suggest the feasibility and effectiveness of our model in relation to
some existing state-of-the-art models.
Related papers
- Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Phonological Level wav2vec2-based Mispronunciation Detection and
Diagnosis Method [11.069975459609829]
We propose a low-level Mispronunciation Detection and Diagnosis (MDD) approach based on the detection of speech attribute features.
The proposed method was applied to L2 speech corpora collected from English learners from different native languages.
arXiv Detail & Related papers (2023-11-13T02:41:41Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End
Speech Recognition [21.61242091927018]
Out-Of-Vocabulary words, such as trending words and new named entities, pose problems to modern ASR systems.
We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words.
arXiv Detail & Related papers (2023-02-20T02:21:30Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - A Brief Study on the Effects of Training Generative Dialogue Models with
a Semantic loss [37.8626106992769]
We study the effects of minimizing an alternate training objective that fosters a model to generate alternate response and score it on semantic similarity.
We explore this idea on two different sized data sets on the task of next utterance generation in goal oriented dialogues.
arXiv Detail & Related papers (2021-06-20T04:39:29Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - Analysis of Predictive Coding Models for Phonemic Representation
Learning in Small Datasets [0.0]
The present study investigates the behaviour of two predictive coding models, Autoregressive Predictive Coding and Contrastive Predictive Coding, in a phoneme discrimination task.
Our experiments show a strong correlation between the autoregressive loss and the phoneme discrimination scores with the two datasets.
The CPC model shows rapid convergence already after one pass over the training data, and, on average, its representations outperform those of APC on both languages.
arXiv Detail & Related papers (2020-07-08T15:46:13Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.