Improved Robustness to Disfluencies in RNN-Transducer Based Speech
Recognition
- URL: http://arxiv.org/abs/2012.06259v1
- Date: Fri, 11 Dec 2020 11:47:13 GMT
- Title: Improved Robustness to Disfluencies in RNN-Transducer Based Speech
Recognition
- Authors: Valentin Mendelev, Tina Raissi, Guglielmo Camporese, Manuel Giollo
- Abstract summary: We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies.
We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves.
- Score: 1.8702587873591643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Speech Recognition (ASR) based on Recurrent Neural Network
Transducers (RNN-T) is gaining interest in the speech community. We investigate
data selection and preparation choices aiming for improved robustness of RNN-T
ASR to speech disfluencies with a focus on partial words. For evaluation we use
clean data, data with disfluencies and a separate dataset with speech affected
by stuttering. We show that after including a small amount of data with
disfluencies in the training set the recognition accuracy on the tests with
disfluencies and stuttering improves. Increasing the amount of training data
with disfluencies gives additional gains without degradation on the clean data.
We also show that replacing partial words with a dedicated token helps to get
even better accuracy on utterances with disfluencies and stutter. The
evaluation of our best model shows 22.5% and 16.4% relative WER reduction on
those two evaluation sets.
Related papers
- Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z) - BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data.
Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods.
Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z) - Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Weakly-supervised forced alignment of disfluent speech using
phoneme-level modeling [10.283092375534311]
We propose a simple and effective modification of alignment graph construction using weighted Finite State Transducers.
The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment.
Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements.
arXiv Detail & Related papers (2023-05-30T09:57:36Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - Advancing Stuttering Detection via Data Augmentation, Class-Balanced
Loss and Multi-Contextual Deep Learning [7.42741711946564]
Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances and core behaviors.
In this paper, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme to tackle data scarcity.
In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech.
arXiv Detail & Related papers (2023-02-21T14:03:47Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning [23.13972240042859]
We propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different disfluency types.
FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations.
We present a disfluency dataset based on the public LibriSpeech dataset with synthesized stutters.
arXiv Detail & Related papers (2020-09-23T21:51:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.