NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition
via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning
- URL: http://arxiv.org/abs/2306.12577v1
- Date: Wed, 21 Jun 2023 21:26:19 GMT
- Title: NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition
via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning
- Authors: Kamer Ali Yuksel, Thiago Ferreira, Golara Javadi, Mohamed
El-Badrashiny, Ahmet Gunduz
- Abstract summary: NoRefER is a novel referenceless quality metric for automatic speech recognition (ASR) systems.
NoRefER exploits the known quality relationships between hypotheses from multiple compression levels of an ASR for learning to rank intra-sample hypotheses by quality.
The results indicate that NoRefER correlates highly with reference-based metrics and their intra-sample ranks, indicating a high potential for referenceless ASR evaluation or a/b testing.
- Score: 0.20999222360659603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces NoRefER, a novel referenceless quality metric for
automatic speech recognition (ASR) systems. Traditional reference-based metrics
for evaluating ASR systems require costly ground-truth transcripts. NoRefER
overcomes this limitation by fine-tuning a multilingual language model for
pair-wise ranking ASR hypotheses using contrastive learning with Siamese
network architecture. The self-supervised NoRefER exploits the known quality
relationships between hypotheses from multiple compression levels of an ASR for
learning to rank intra-sample hypotheses by quality, which is essential for
model comparisons. The semi-supervised version also uses a referenced dataset
to improve its inter-sample quality ranking, which is crucial for selecting
potentially erroneous samples. The results indicate that NoRefER correlates
highly with reference-based metrics and their intra-sample ranks, indicating a
high potential for referenceless ASR evaluation or a/b testing.
Related papers
- Word-Level ASR Quality Estimation for Efficient Corpus Sampling and
Post-Editing through Analyzing Attentions of a Reference-Free Metric [5.592917884093537]
The potential of quality estimation (QE) metrics is introduced and evaluated as a novel tool to enhance explainable artificial intelligence (XAI) in ASR systems.
The capabilities of the NoRefER metric are explored in identifying word-level errors to aid post-editors in refining ASR hypotheses.
arXiv Detail & Related papers (2024-01-20T16:48:55Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - HypR: A comprehensive study for ASR hypothesis revising with a reference corpus [10.173199736362486]
This study focuses on providing an ASR hypothesis revising (HypR) dataset in this study.
HypR contains several commonly used corpora and provides 50 recognition hypotheses for each speech utterance.
In addition, we implement and compare several classic and representative methods, showing the recent research progress in revising speech recognition results.
arXiv Detail & Related papers (2023-09-18T14:55:21Z) - A Reference-less Quality Metric for Automatic Speech Recognition via
Contrastive-Learning of a Multi-Language Model with Self-Supervision [0.20999222360659603]
This work proposes a referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions.
To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner.
The proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments.
arXiv Detail & Related papers (2023-06-21T21:33:39Z) - Factual Consistency Oriented Speech Recognition [23.754107608608106]
The proposed framework optimize the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions.
It is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries.
arXiv Detail & Related papers (2023-02-24T00:01:41Z) - Learning Transformer Features for Image Quality Assessment [53.51379676690971]
We propose a unified IQA framework that utilizes CNN backbone and transformer encoder to extract features.
The proposed framework is compatible with both FR and NR modes and allows for a joint training scheme.
arXiv Detail & Related papers (2021-12-01T13:23:00Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - A Correspondence Variational Autoencoder for Unsupervised Acoustic Word
Embeddings [50.524054820564395]
We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.
The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages.
arXiv Detail & Related papers (2020-12-03T19:24:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.