Related papers: End-to-End Speech Recognition and Disfluency Removal

End-to-End Speech Recognition and Disfluency Removal

URL: http://arxiv.org/abs/2009.10298v3
Date: Mon, 28 Sep 2020 23:07:21 GMT
Title: End-to-End Speech Recognition and Disfluency Removal
Authors: Paria Jamshid Lou and Mark Johnson
Abstract summary: This paper investigates the task of end-to-end speech recognition and disfluency removal. We show that end-to-end models do learn to directly generate fluent transcripts. We propose two new metrics that can be used for evaluating integrated ASR and disfluency models.
Score: 15.910282983166024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency detection model. We also propose two new metrics that can be used for evaluating integrated ASR and disfluency models. The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future.

Related papers

Augmenting Automatic Speech Recognition Models with Disfluency Detection [12.45703869323415]
Speech disfluency commonly occurs in conversational and spontaneous speech. Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech. We present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies.
arXiv Detail & Related papers (2024-09-16T11:13:14Z)
Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs) To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods. Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z)
Automatic Disfluency Detection from Untranscribed Speech [25.534535098405602]
Stuttering is a speech disorder characterized by a high rate of disfluencies. automatic disfluency detection may help in treatment planning for individuals who stutter. We investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization.
arXiv Detail & Related papers (2023-11-01T21:36:39Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi. Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z)
Streaming Joint Speech Recognition and Disfluency Detection [30.018034246393725]
We propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency.
arXiv Detail & Related papers (2022-11-16T07:34:20Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing. This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z)
Auxiliary Sequence Labeling Tasks for Disfluency Detection [6.460424516393765]
We propose a method utilizing named entity recognition (NER) and part-of-speech (POS) as auxiliary sequence labeling (SL) tasks for disfluency detection. We show that training a disfluency detection model with auxiliary SL tasks can improve its F-score in disfluency detection. Experimental results on the widely used English Switchboard dataset show that our method outperforms the previous state-of-the-art in disfluency detection.
arXiv Detail & Related papers (2020-10-24T02:51:17Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.