Weakly-supervised forced alignment of disfluent speech using
phoneme-level modeling
- URL: http://arxiv.org/abs/2306.00996v1
- Date: Tue, 30 May 2023 09:57:36 GMT
- Title: Weakly-supervised forced alignment of disfluent speech using
phoneme-level modeling
- Authors: Theodoros Kouzelis, Georgios Paraskevopoulos, Athanasios Katsamanis,
Vassilis Katsouros
- Abstract summary: We propose a simple and effective modification of alignment graph construction using weighted Finite State Transducers.
The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment.
Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements.
- Score: 10.283092375534311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The study of speech disorders can benefit greatly from time-aligned data.
However, audio-text mismatches in disfluent speech cause rapid performance
degradation for modern speech aligners, hindering the use of automatic
approaches. In this work, we propose a simple and effective modification of
alignment graph construction of CTC-based models using Weighted Finite State
Transducers. The proposed weakly-supervised approach alleviates the need for
verbatim transcription of speech disfluencies for forced alignment. During the
graph construction, we allow the modeling of common speech disfluencies, i.e.
repetitions and omissions. Further, we show that by assessing the degree of
audio-text mismatch through the use of Oracle Error Rate, our method can be
effectively used in the wild. Our evaluation on a corrupted version of the
TIMIT test set and the UCLASS dataset shows significant improvements,
particularly for recall, achieving a 23-25% relative improvement over our
baselines.
Related papers
- Augmenting Automatic Speech Recognition Models with Disfluency Detection [12.45703869323415]
Speech disfluency commonly occurs in conversational and spontaneous speech.
Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech.
We present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies.
arXiv Detail & Related papers (2024-09-16T11:13:14Z) - Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Curriculum optimization for low-resource speech recognition [4.803994937990389]
We propose an automated curriculum learning approach to optimize the sequence of training examples.
We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions.
arXiv Detail & Related papers (2022-02-17T19:47:50Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.