A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision
- URL: http://arxiv.org/abs/2306.13114v1
- Date: Wed, 21 Jun 2023 21:33:39 GMT
- Title: A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision
- Authors: Kamer Ali Yuksel, Thiago Ferreira, Ahmet Gunduz, Mohamed
Al-Badrashiny, Golara Javadi
- Abstract summary: This work proposes a referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions.
To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner.
The proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments.
- Score: 0.20999222360659603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The common standard for quality evaluation of automatic speech recognition
(ASR) systems is reference-based metrics such as the Word Error Rate (WER),
computed using manual ground-truth transcriptions that are time-consuming and
expensive to obtain. This work proposes a multi-language referenceless quality
metric, which allows comparing the performance of different ASR models on a
speech dataset without ground truth transcriptions. To estimate the quality of
ASR hypotheses, a pre-trained language model (LM) is fine-tuned with
contrastive learning in a self-supervised learning manner. In experiments
conducted on several unseen test datasets consisting of outputs from top
commercial ASR engines in various languages, the proposed referenceless metric
obtains a much higher correlation with WER scores and their ranks than the
perplexity metric from the state-of-art multi-lingual LM in all experiments,
and also reduces WER by more than $7\%$ when used for ensembling hypotheses.
The fine-tuned model and experiments are made available for the
reproducibility: https://github.com/aixplain/NoRefER
Related papers
- Automatic Speech Recognition System-Independent Word Error Rate Estimation [23.25173244408922]
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems.
In this paper, a hypothesis generation method for ASR System-Independent WER estimation is proposed.
arXiv Detail & Related papers (2024-04-25T16:57:05Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition
via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning [0.20999222360659603]
NoRefER is a novel referenceless quality metric for automatic speech recognition (ASR) systems.
NoRefER exploits the known quality relationships between hypotheses from multiple compression levels of an ASR for learning to rank intra-sample hypotheses by quality.
The results indicate that NoRefER correlates highly with reference-based metrics and their intra-sample ranks, indicating a high potential for referenceless ASR evaluation or a/b testing.
arXiv Detail & Related papers (2023-06-21T21:26:19Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward.
But the evaluation process becomes tricky as soon as we switch from development to application.
We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - WER we are and WER we think we are [11.819335591315316]
We express skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets.
We compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB'05 public benchmark.
We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.
arXiv Detail & Related papers (2020-10-07T14:20:31Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.