Residual Energy-Based Models for End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2103.14152v1
- Date: Thu, 25 Mar 2021 22:08:00 GMT
- Title: Residual Energy-Based Models for End-to-End Speech Recognition
- Authors: Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland
- Abstract summary: Residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model.
Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7%.
On a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance.
- Score: 26.852537542649866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end models with auto-regressive decoders have shown impressive results
for automatic speech recognition (ASR). These models formulate the
sequence-level probability as a product of the conditional probabilities of all
individual tokens given their histories. However, the performance of locally
normalised models can be sub-optimal because of factors such as exposure bias.
Consequently, the model distribution differs from the underlying data
distribution. In this paper, the residual energy-based model (R-EBM) is
proposed to complement the auto-regressive ASR model to close the gap between
the two distributions. Meanwhile, R-EBMs can also be regarded as
utterance-level confidence estimators, which may benefit many downstream tasks.
Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word
error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall
curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets.
Furthermore, on a state-of-the-art model using self-supervised learning
(wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence
estimation performance.
Related papers
- Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference [0.0]
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference.
EAD switches between different-sized models based on prediction uncertainty.
We show remarkable efficiency gains across different model families.
arXiv Detail & Related papers (2025-02-05T22:15:21Z) - Supervised Score-Based Modeling by Gradient Boosting [49.556736252628745]
We propose a Supervised Score-based Model (SSM) which can be viewed as a gradient boosting algorithm combining score matching.
We provide a theoretical analysis of learning and sampling for SSM to balance inference time and prediction accuracy.
Our model outperforms existing models in both accuracy and inference time.
arXiv Detail & Related papers (2024-11-02T07:06:53Z) - Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.
Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - Wavelet-Based Hybrid Machine Learning Model for Out-of-distribution
Internet Traffic Prediction [3.689539481706835]
This paper investigates machine learning performances using eXtreme Gradient Boosting, Light Gradient Boosting Machine, Gradient Descent, Gradient Boosting Regressor, Cat Regressor.
We propose a hybrid machine learning model integrating wavelet decomposition for improving out-of-distribution prediction.
arXiv Detail & Related papers (2022-05-09T14:34:42Z) - Improving Confidence Estimation on Out-of-Domain Data for End-to-End
Speech Recognition [25.595147432155642]
This paper proposes two approaches to improve the model-based confidence estimators on out-of-domain data.
Experiments show that the proposed methods can significantly improve the confidence metrics on TED-LIUM and Switchboard datasets.
arXiv Detail & Related papers (2021-10-07T10:44:27Z) - Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition.
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.