Evaluating Automatic Speech Recognition in an Incremental Setting
- URL: http://arxiv.org/abs/2302.12049v1
- Date: Thu, 23 Feb 2023 14:22:40 GMT
- Title: Evaluating Automatic Speech Recognition in an Incremental Setting
- Authors: Ryan Whetten, Mir Tahsin Imtiaz, Casey Kennington
- Abstract summary: We systematically evaluate six speech recognizers using metrics including word error rate, latency, and the number of updates to already recognized words on English test data.
We find that, generally, local recognizers are faster and require fewer updates than cloud-based recognizers.
- Score: 0.7734726150561086
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The increasing reliability of automatic speech recognition has proliferated
its everyday use. However, for research purposes, it is often unclear which
model one should choose for a task, particularly if there is a requirement for
speed as well as accuracy. In this paper, we systematically evaluate six speech
recognizers using metrics including word error rate, latency, and the number of
updates to already recognized words on English test data, as well as propose
and compare two methods for streaming audio into recognizers for incremental
recognition. We further propose Revokes per Second as a new metric for
evaluating incremental recognition and demonstrate that it provides insights
into overall model performance. We find that, generally, local recognizers are
faster and require fewer updates than cloud-based recognizers. Finally, we find
Meta's Wav2Vec model to be the fastest, and find Mozilla's DeepSpeech model to
be the most stable in its predictions.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Self-consistent context aware conformer transducer for speech recognition [0.06008132390640294]
We introduce a novel neural network module that adeptly handles recursive data flow in neural network architectures.
Our method notably improves the accuracy of recognizing rare words without adversely affecting the word error rate for common vocabulary.
Our findings reveal that the combination of both approaches can improve the accuracy of detecting rare words by as much as 4.5 times.
arXiv Detail & Related papers (2024-02-09T18:12:11Z) - Proactive Detection of Voice Cloning with Localized Watermarking [50.13539630769929]
We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech.
AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level.
AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics.
arXiv Detail & Related papers (2024-01-30T18:56:22Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer [0.0]
The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec.
The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.
arXiv Detail & Related papers (2022-10-21T11:26:59Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - A Lightweight Speaker Recognition System Using Timbre Properties [0.5708902722746041]
We propose a lightweight text-independent speaker recognition model based on random forest classifier.
It also introduces new features that are used for both speaker verification and identification tasks.
The prototype uses seven most actively searched properties, boominess, brightness, depth, hardness, timbre, sharpness, and warmth.
arXiv Detail & Related papers (2020-10-12T07:56:03Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.