Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment
- URL: http://arxiv.org/abs/2406.15723v1
- Date: Sat, 22 Jun 2024 03:56:29 GMT
- Title: Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment
- Authors: Heejin Do, Wonjun Lee, Gary Geunbae Lee,
- Abstract summary: We propose two Acoustic Feature Mixup strategies to address data scarcity and score-label imbalances.
We integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation.
- Score: 7.519788903817844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.
Related papers
- Score-balanced Loss for Multi-aspect Pronunciation Assessment [3.6825890616838066]
We propose a novel loss function, score-balanced loss, to address the problem caused by uneven data.
As a re-weighting approach, we assign higher costs when the predicted score is of the minority class.
We evaluate our method on the speechocean762 dataset, which has noticeably imbalanced scores for several aspects.
arXiv Detail & Related papers (2023-05-26T06:21:37Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Contrastive-mixup learning for improved speaker verification [17.93491404662201]
This paper proposes a novel formulation of prototypical loss with mixup for speaker verification.
Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs.
arXiv Detail & Related papers (2022-02-22T05:09:22Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Approach to Mispronunciation Detection and Diagnosis with Acoustic,
Phonetic and Linguistic (APL) Embeddings [18.282632348274756]
Phonetic embeddings, extracted from ASR models trained with huge amount of word level annotations, can serve as a good representation of the content of input speech.
We propose to utilize Acoustic, Phonetic and Linguistic (APL) embedding features jointly for building a more powerful MD&D system.
arXiv Detail & Related papers (2021-10-14T11:25:02Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - Predicting the Humorousness of Tweets Using Gaussian Process Preference
Learning [56.18809963342249]
We present a probabilistic approach that learns to rank and rate the humorousness of short texts by exploiting human preference judgments and automatically sourced linguistic annotations.
We report system performance for the campaign's two subtasks, humour detection and funniness score prediction, and discuss some issues arising from the conversion between the numeric scores used in the HAHA@IberLEF 2019 data and the pairwise judgment annotations required for our method.
arXiv Detail & Related papers (2020-08-03T13:05:42Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.