Evaluating User Perception of Speech Recognition System Quality with
Semantic Distance Metric
- URL: http://arxiv.org/abs/2110.05376v1
- Date: Mon, 11 Oct 2021 16:09:01 GMT
- Title: Evaluating User Perception of Speech Recognition System Quality with
Semantic Distance Metric
- Authors: Suyoun Kim, Duc Le, Weiyi Zheng, Tarun Singh, Abhinav Arora, Xiaoyu
Zhai, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer
- Abstract summary: Word Error Rate (WER) has been traditionally used to evaluate ASR system quality.
We propose evaluating ASR output hypotheses quality with SemDist that can measure semantic correctness.
- Score: 22.884709676587377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Measuring automatic speech recognition (ASR) system quality is critical for
creating user-satisfying voice-driven applications. Word Error Rate (WER) has
been traditionally used to evaluate ASR system quality; however, it sometimes
correlates poorly with user perception of transcription quality. This is
because WER weighs every word equally and does not consider semantic
correctness which has a higher impact on user perception. In this work, we
propose evaluating ASR output hypotheses quality with SemDist that can measure
semantic correctness by using the distance between the semantic vectors of the
reference and hypothesis extracted from a pre-trained language model. Our
experimental results of 71K and 36K user annotated ASR output quality show that
SemDist achieves higher correlation with user perception than WER. We also show
that SemDist has higher correlation with downstream NLU tasks than WER.
Related papers
- Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [52.624909026294105]
We propose a non-autoregressive speech error correction method.
A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses.
The proposed system reduces the error rate by 21% compared with the ASR model.
arXiv Detail & Related papers (2024-06-29T17:56:28Z) - Automatic Speech Recognition System-Independent Word Error Rate Estimation [23.25173244408922]
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems.
In this paper, a hypothesis generation method for ASR System-Independent WER estimation is proposed.
arXiv Detail & Related papers (2024-04-25T16:57:05Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Toward Practical Automatic Speech Recognition and Post-Processing: a
Call for Explainable Error Benchmark Guideline [12.197453599489963]
We propose the development of an Error Explainable Benchmark (EEB) dataset.
This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings.
Our proposition provides a structured pathway for a more real-world-centric' evaluation, allowing for the detection and rectification of nuanced system weaknesses.
arXiv Detail & Related papers (2024-01-26T03:42:45Z) - NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition
via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning [0.20999222360659603]
NoRefER is a novel referenceless quality metric for automatic speech recognition (ASR) systems.
NoRefER exploits the known quality relationships between hypotheses from multiple compression levels of an ASR for learning to rank intra-sample hypotheses by quality.
The results indicate that NoRefER correlates highly with reference-based metrics and their intra-sample ranks, indicating a high potential for referenceless ASR evaluation or a/b testing.
arXiv Detail & Related papers (2023-06-21T21:26:19Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Semantic-WER: A Unified Metric for the Evaluation of ASR Transcript for
End Usability [1.599072005190786]
State-of-the-art systems have achieved a word error rate (WER) less than 5%.
Semantic-WER (SWER) is a metric to evaluate the ASR transcripts for downstream applications in general.
arXiv Detail & Related papers (2021-06-03T17:35:14Z) - Semantic Distance: A New Metric for ASR Performance Analysis Towards
Spoken Language Understanding [26.958001571944678]
We propose a novel Semantic Distance (SemDist) measure as an alternative evaluation metric for ASR systems.
We demonstrate the effectiveness of our proposed metric on various downstream tasks, including intent recognition, semantic parsing, and named entity recognition.
arXiv Detail & Related papers (2021-04-05T20:25:07Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.