End-to-End Speech Recognition and Disfluency Removal
- URL: http://arxiv.org/abs/2009.10298v3
- Date: Mon, 28 Sep 2020 23:07:21 GMT
- Title: End-to-End Speech Recognition and Disfluency Removal
- Authors: Paria Jamshid Lou and Mark Johnson
- Abstract summary: This paper investigates the task of end-to-end speech recognition and disfluency removal.
We show that end-to-end models do learn to directly generate fluent transcripts.
We propose two new metrics that can be used for evaluating integrated ASR and disfluency models.
- Score: 15.910282983166024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Disfluency detection is usually an intermediate step between an automatic
speech recognition (ASR) system and a downstream task. By contrast, this paper
aims to investigate the task of end-to-end speech recognition and disfluency
removal. We specifically explore whether it is possible to train an ASR model
to directly map disfluent speech into fluent transcripts, without relying on a
separate disfluency detection model. We show that end-to-end models do learn to
directly generate fluent transcripts; however, their performance is slightly
worse than a baseline pipeline approach consisting of an ASR system and a
disfluency detection model. We also propose two new metrics that can be used
for evaluating integrated ASR and disfluency models. The findings of this paper
can serve as a benchmark for further research on the task of end-to-end speech
recognition and disfluency removal in the future.
Related papers
- Augmenting Automatic Speech Recognition Models with Disfluency Detection [12.45703869323415]
Speech disfluency commonly occurs in conversational and spontaneous speech.
Current research mainly focuses on detecting disfluencies within transcripts, overlooking their exact location and duration in the speech.
We present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies.
arXiv Detail & Related papers (2024-09-16T11:13:14Z) - Towards interfacing large language models with ASR systems using confidence measures and prompting [54.39667883394458]
This work investigates post-hoc correction of ASR transcripts with large language models (LLMs)
To avoid introducing errors into likely accurate transcripts, we propose a range of confidence-based filtering methods.
Our results indicate that this can improve the performance of less competitive ASR systems.
arXiv Detail & Related papers (2024-07-31T08:00:41Z) - Automatic Disfluency Detection from Untranscribed Speech [25.534535098405602]
Stuttering is a speech disorder characterized by a high rate of disfluencies.
automatic disfluency detection may help in treatment planning for individuals who stutter.
We investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization.
arXiv Detail & Related papers (2023-11-01T21:36:39Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - Streaming Joint Speech Recognition and Disfluency Detection [30.018034246393725]
We propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection.
Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors.
We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency.
arXiv Detail & Related papers (2022-11-16T07:34:20Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Improving Distinction between ASR Errors and Speech Disfluencies with
Feature Space Interpolation [0.0]
Fine-tuning pretrained language models (LMs) is a popular approach to automatic speech recognition (ASR) error detection during post-processing.
This paper proposes a scheme to improve existing LM-based ASR error detection systems.
arXiv Detail & Related papers (2021-08-04T02:11:37Z) - Auxiliary Sequence Labeling Tasks for Disfluency Detection [6.460424516393765]
We propose a method utilizing named entity recognition (NER) and part-of-speech (POS) as auxiliary sequence labeling (SL) tasks for disfluency detection.
We show that training a disfluency detection model with auxiliary SL tasks can improve its F-score in disfluency detection.
Experimental results on the widely used English Switchboard dataset show that our method outperforms the previous state-of-the-art in disfluency detection.
arXiv Detail & Related papers (2020-10-24T02:51:17Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.