Phonetic and Prosody-aware Self-supervised Learning Approach for
Non-native Fluency Scoring
- URL: http://arxiv.org/abs/2305.11438v1
- Date: Fri, 19 May 2023 05:39:41 GMT
- Title: Phonetic and Prosody-aware Self-supervised Learning Approach for
Non-native Fluency Scoring
- Authors: Kaiqi Fu, Shaojun Gao, Shuju Shi, Xiaohai Tian, Wei Li, Zejun Ma
- Abstract summary: Speech fluency/disfluency can be evaluated by analyzing a range of phonetic and prosodic features.
Deep neural networks are commonly trained to map fluency-related features into the human scores.
We introduce a self-supervised learning (SSL) approach that takes into account phonetic and prosody awareness for fluency scoring.
- Score: 13.817385516193445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech fluency/disfluency can be evaluated by analyzing a range of phonetic
and prosodic features. Deep neural networks are commonly trained to map
fluency-related features into the human scores. However, the effectiveness of
deep learning-based models is constrained by the limited amount of labeled
training samples. To address this, we introduce a self-supervised learning
(SSL) approach that takes into account phonetic and prosody awareness for
fluency scoring. Specifically, we first pre-train the model using a
reconstruction loss function, by masking phones and their durations jointly on
a large amount of unlabeled speech and text prompts. We then fine-tune the
pre-trained model using human-annotated scoring data. Our experimental results,
conducted on datasets such as Speechocean762 and our non-native datasets, show
that our proposed method outperforms the baseline systems in terms of Pearson
correlation coefficients (PCC). Moreover, we also conduct an ablation study to
better understand the contribution of phonetic and prosody factors during the
pre-training stage.
Related papers
- Deep Learning for Assessment of Oral Reading Fluency [5.707725771108279]
This work investigates end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts.
We report the performance of a number of system variations on the relevant measures, and probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.
arXiv Detail & Related papers (2024-05-29T18:09:35Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Self-Supervised Learning for Audio-Based Emotion Recognition [1.7598252755538808]
Self-supervised learning is a family of methods which can learn despite a scarcity of supervised labels.
We have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality.
We find that self-supervised learning consistently improves the performance of the model across all metrics.
arXiv Detail & Related papers (2023-07-23T14:40:50Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Self-Adaptive Training: Bridging the Supervised and Self-Supervised
Learning [16.765461276790944]
Self-adaptive training is a unified training algorithm that dynamically calibrates and enhances training process by model predictions without incurring extra computational cost.
We analyze the training dynamics of deep networks on training data corrupted by, e.g., random noise and adversarial examples.
Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of emphany label information.
arXiv Detail & Related papers (2021-01-21T17:17:30Z) - Open-set Short Utterance Forensic Speaker Verification using
Teacher-Student Network with Explicit Inductive Bias [59.788358876316295]
We propose a pipeline solution to improve speaker verification on a small actual forensic field dataset.
By leveraging large-scale out-of-domain datasets, a knowledge distillation based objective function is proposed for teacher-student learning.
We show that the proposed objective function can efficiently improve the performance of teacher-student learning on short utterances.
arXiv Detail & Related papers (2020-09-21T00:58:40Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z) - Embodied Self-supervised Learning by Coordinated Sampling and Training [14.107020105091662]
We propose a novel self-supervised approach to solve inverse problems by employing the corresponding physical forward process.
The proposed approach works in an analysis-by-synthesis manner to learn an inference network by iteratively sampling and training.
We prove the feasibility of the proposed method by tackling the acoustic-to-articulatory inversion problem to infer articulatory information from speech.
arXiv Detail & Related papers (2020-06-20T14:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.