Related papers: BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

URL: http://arxiv.org/abs/2312.14609v1
Date: Fri, 22 Dec 2023 11:12:45 GMT
Title: BLSTM-Based Confidence Estimation for End-to-End Speech Recognition
Authors: Atsunori Ogawa, Naohiro Tawara, Takatomo Kano, Marc Delcroix
Abstract summary: Confidence estimation is an important function for developing automatic speech recognition (ASR) applications. Recent E2E ASR systems show high performance (e.g., around 5% token error rates) for various ASR tasks. We employ a bidirectional long short-term memory (BLSTM)-based model as a strong binary-class (correct/incorrect) sequence labeler.
Score: 41.423717224691046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for developing ASR applications. In this study, we perform confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR systems show high performance (e.g., around 5% token error rates) for various ASR tasks. In such situations, confidence estimation becomes difficult since we need to detect infrequent incorrect tokens from mostly correct token sequences. To tackle this imbalanced dataset problem, we employ a bidirectional long short-term memory (BLSTM)-based model as a strong binary-class (correct/incorrect) sequence labeler that is trained with a class balancing objective. We experimentally confirmed that, by utilizing several types of ASR decoding scores as its auxiliary features, the model steadily shows high confidence estimation performance under highly imbalanced settings. We also confirmed that the BLSTM-based model outperforms Transformer-based confidence estimation models, which greatly underestimate incorrect tokens.

Related papers

Energy Score-based Pseudo-Label Filtering and Adaptive Loss for Imbalanced Semi-supervised SAR target recognition [1.2035771704626825]
Existing semi-supervised SAR ATR algorithms show low recognition accuracy in the case of class imbalance. This work offers a non-balanced semi-supervised SAR target recognition approach using dynamic energy scores and adaptive loss.
arXiv Detail & Related papers (2024-11-06T14:45:16Z)
TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR [1.8477401359673709]
Class-probability-based confidence scores do not accurately represent quality of overconfident ASR predictions. We propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train Confidence Estimation Model (CEM) We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes.
arXiv Detail & Related papers (2024-01-06T16:29:13Z)
Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification. We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate. We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z)
A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition [74.79785063365289]
Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets. We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.
arXiv Detail & Related papers (2023-05-21T15:31:23Z)
Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System [42.569506907182706]
Previous end-to-end(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur. We propose CIF-Aligned confidence estimation model (CA-CEM) to achieve accurate and reliable confidence estimation based on novel non-autoregressive E2E ASR model - Paraformer.
arXiv Detail & Related papers (2023-05-18T03:34:50Z)
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition [25.595147432155642]
This paper proposes two approaches to improve the model-based confidence estimators on out-of-domain data. Experiments show that the proposed methods can significantly improve the confidence metrics on TED-LIUM and Switchboard datasets.
arXiv Detail & Related papers (2021-10-07T10:44:27Z)
Learning Word-Level Confidence For Subword End-to-End ASR [48.09713798451474]
We study the problem of word-level confidence estimation in subword-based end-to-end (E2E) models for automatic speech recognition (ASR) The proposed confidence module also enables a model selection approach to combine an on-device E2E model with a hybrid model on the server to address the rare word recognition problem for the E2E model.
arXiv Detail & Related papers (2021-03-11T15:03:33Z)
Don't Just Blame Over-parametrization for Over-confidence: Theoretical Analysis of Calibration in Binary Classification [58.03725169462616]
We show theoretically that over-parametrization is not the only reason for over-confidence. We prove that logistic regression is inherently over-confident, in the realizable, under-parametrized setting. Perhaps surprisingly, we also show that over-confidence is not always the case.
arXiv Detail & Related papers (2021-02-15T21:38:09Z)
An evaluation of word-level confidence estimation for end-to-end automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR) We provide an extensive benchmark of popular confidence methods on four well-known speech datasets. Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.