Fast Word Error Rate Estimation Using Self-Supervised Representations
For Speech And Text
- URL: http://arxiv.org/abs/2310.08225v1
- Date: Thu, 12 Oct 2023 11:17:40 GMT
- Title: Fast Word Error Rate Estimation Using Self-Supervised Representations
For Speech And Text
- Authors: Chanho Park, Chengsong Lu, Mingjie Chen, Thomas Hain
- Abstract summary: The quality of automatic speech recognition (ASR) is typically measured by word error rate (WER)
WER estimation is a task aiming to predict the WER of an ASR system, given a speech utterance and a transcription.
This paper introduces a Fast WER estimator (Fe-WER) using self-supervised learning representation (SSLR)
- Score: 23.25173244408922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of automatic speech recognition (ASR) is typically measured by
word error rate (WER). WER estimation is a task aiming to predict the WER of an
ASR system, given a speech utterance and a transcription. This task has gained
increasing attention while advanced ASR systems are trained on large amounts of
data. In this case, WER estimation becomes necessary in many scenarios, for
example, selecting training data with unknown transcription quality or
estimating the testing performance of an ASR system without ground truth
transcriptions. Facing large amounts of data, the computation efficiency of a
WER estimator becomes essential in practical applications. However, previous
works usually did not consider it as a priority. In this paper, a Fast WER
estimator (Fe-WER) using self-supervised learning representation (SSLR) is
introduced. The estimator is built upon SSLR aggregated by average pooling. The
results show that Fe-WER outperformed the e-WER3 baseline relatively by 19.69%
and 7.16% on Ted-Lium3 in both evaluation metrics of root mean square error and
Pearson correlation coefficient, respectively. Moreover, the estimation
weighted by duration was 10.43% when the target was 10.88%. Lastly, the
inference speed was about 4x in terms of a real-time factor.
Related papers
- Automatic Speech Recognition System-Independent Word Error Rate Estimation [23.25173244408922]
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems.
In this paper, a hypothesis generation method for ASR System-Independent WER estimation is proposed.
arXiv Detail & Related papers (2024-04-25T16:57:05Z) - UCorrect: An Unsupervised Framework for Automatic Speech Recognition
Error Correction [18.97378605403447]
We propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction.
Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect.
arXiv Detail & Related papers (2024-01-11T06:30:07Z) - TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in
End-to-End ASR [1.8477401359673709]
Class-probability-based confidence scores do not accurately represent quality of overconfident ASR predictions.
We propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train Confidence Estimation Model (CEM)
We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes.
arXiv Detail & Related papers (2024-01-06T16:29:13Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - H_eval: A new hybrid evaluation metric for automatic speech recognition
tasks [0.3277163122167433]
We propose H_eval, a new hybrid evaluation metric for ASR systems.
It considers both semantic correctness and error rate and performs significantly well in scenarios where WER and SD perform poorly.
arXiv Detail & Related papers (2022-11-03T11:23:36Z) - Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and
Self-training of Neural Transducer [20.8850874806462]
This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data.
For the fine-tuning task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
For the self-training task, ASR models are trained using supervised data from Wall Street Journal (WSJ), Aurora-4 along with CHiME-4 real noisy data as unlabeled data.
arXiv Detail & Related papers (2022-07-29T15:14:03Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - VSAC: Efficient and Accurate Estimator for H and F [68.65610177368617]
VSAC is a RANSAC-type robust estimator with a number of novelties.
It is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU.
It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry.
arXiv Detail & Related papers (2021-06-18T17:04:57Z) - Fast Uncertainty Quantification for Deep Object Pose Estimation [91.09217713805337]
Deep learning-based object pose estimators are often unreliable and overconfident.
In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose estimation.
arXiv Detail & Related papers (2020-11-16T06:51:55Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.