WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal
Classification Paradigm
- URL: http://arxiv.org/abs/2101.05478v2
- Date: Sat, 13 Feb 2021 15:18:19 GMT
- Title: WER-BERT: Automatic WER Estimation with BERT in a Balanced Ordinal
Classification Paradigm
- Authors: Akshay Krishna Sheshadri, Anvesh Rao Vijjini, Sukhdeep Kharbanda
- Abstract summary: We propose a new balanced paradigm for e-WER in a classification setting.
Within this paradigm, we also propose WER-BERT, a BERT based architecture with speech features for e-WER.
The results and experiments demonstrate that WER-BERT establishes a new state-of-the-art in automatic WER estimation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems are evaluated using Word Error
Rate (WER), which is calculated by comparing the number of errors between the
ground truth and the transcription of the ASR system. This calculation,
however, requires manual transcription of the speech signal to obtain the
ground truth. Since transcribing audio signals is a costly process, Automatic
WER Evaluation (e-WER) methods have been developed to automatically predict the
WER of a speech system by only relying on the transcription and the speech
signal features. While WER is a continuous variable, previous works have shown
that positing e-WER as a classification problem is more effective than
regression. However, while converting to a classification setting, these
approaches suffer from heavy class imbalance. In this paper, we propose a new
balanced paradigm for e-WER in a classification setting. Within this paradigm,
we also propose WER-BERT, a BERT based architecture with speech features for
e-WER. Furthermore, we introduce a distance loss function to tackle the ordinal
nature of e-WER classification. The proposed approach and paradigm are
evaluated on the Librispeech dataset and a commercial (black box) ASR system,
Google Cloud's Speech-to-Text API. The results and experiments demonstrate that
WER-BERT establishes a new state-of-the-art in automatic WER estimation.
Related papers
- Automatic Speech Recognition System-Independent Word Error Rate Estimation [23.25173244408922]
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems.
In this paper, a hypothesis generation method for ASR System-Independent WER estimation is proposed.
arXiv Detail & Related papers (2024-04-25T16:57:05Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Zero-Shot Automatic Pronunciation Assessment [19.971348810774046]
We propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT.
Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines.
arXiv Detail & Related papers (2023-05-31T05:17:17Z) - Explanations for Automatic Speech Recognition [9.810810252231812]
We provide an explanation for an ASR transcription as a subset of audio frames.
We adapt existing explainable AI techniques from image classification-Statistical Fault Localisation(SFL) and Causal.
We evaluate the quality of the explanations generated by the proposed techniques over three different ASR,Google API, the baseline model of Sphinx, Deepspeech and 100 audio samples from the Commonvoice dataset.
arXiv Detail & Related papers (2023-02-27T11:09:19Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Hallucination of speech recognition errors with sequence to sequence
learning [16.39332236910586]
When plain text data is to be used to train systems for spoken language understanding or ASR, a proven strategy is to hallucinate what the ASR outputs would be given a gold transcription.
We present novel end-to-end models to directly predict hallucinated ASR word sequence outputs, conditioning on an input word sequence as well as a corresponding phoneme sequence.
This improves prior published results for recall of errors from an in-domain ASR system's transcription of unseen data, as well as an out-of-domain ASR system's transcriptions of audio from an unrelated task.
arXiv Detail & Related papers (2021-03-23T02:09:39Z) - End to End ASR System with Automatic Punctuation Insertion [0.0]
We propose a method to generate punctuated transcript for the TEDLIUM dataset using transcripts available from ted.com.
We also propose an end-to-end ASR system that outputs words and punctuations concurrently from speech signals.
arXiv Detail & Related papers (2020-12-03T15:46:43Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.