BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
- URL: http://arxiv.org/abs/2212.08486v1
- Date: Fri, 16 Dec 2022 14:00:26 GMT
- Title: BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
- Authors: Mingda Chen and Paul-Ambroise Duquenne and Pierre Andrews and Justine
Kao and Alexandre Mourachko and Holger Schwenk and Marta R. Costa-juss\`a
- Abstract summary: End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
- Score: 66.73705349465207
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: End-to-End speech-to-speech translation (S2ST) is generally evaluated with
text-based metrics. This means that generated speech has to be automatically
transcribed, making the evaluation dependent on the availability and quality of
automatic speech recognition (ASR) systems. In this paper, we propose a
text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the
dependency on ASR systems. BLASER leverages a multilingual multimodal encoder
to directly encode the speech segments for source input, translation output and
reference into a shared embedding space and computes a score of the translation
quality that can be used as a proxy to human evaluation. To evaluate our
approach, we construct training and evaluation sets from more than 40k human
annotations covering seven language directions. The best results of BLASER are
achieved by training with supervision from human rating scores. We show that
when evaluated at the sentence level, BLASER correlates significantly better
with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in
all translation directions and ASR-COMET in five of them. Our analysis shows
combining speech and text as inputs to BLASER does not increase the correlation
with human scores, but best correlations are achieved when using speech, which
motivates the goal of our research. Moreover, we show that using ASR for
references is detrimental for text-based metrics.
Related papers
- CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Quantification of stylistic differences in human- and ASR-produced transcripts of African American English [1.8021379035665333]
Stylistic differences, such as verbatim vs non-verbatim, can play a significant role in ASR performance evaluation.
We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English speech.
We investigate the interactions of these categories with how well transcripts can be compared via word error rate.
arXiv Detail & Related papers (2024-09-04T20:18:59Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - WACO: Word-Aligned Contrastive Learning for Speech Translation [11.67083845641806]
Speech Translation (E2E) aims to directly translate source speech into target text.
Existing ST methods perform poorly when only extremely small speech-text data are available for training.
We propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation.
arXiv Detail & Related papers (2022-12-19T10:49:35Z) - SpeechLMScore: Evaluating speech generation using speech language model [43.20067175503602]
We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model.
It does not require human annotation and is a highly scalable framework.
Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks.
arXiv Detail & Related papers (2022-12-08T21:00:15Z) - Benchmarking Evaluation Metrics for Code-Switching Automatic Speech
Recognition [19.763431520942028]
We develop a benchmark data set of code-switching speech recognition hypotheses with human judgments.
We define clear guidelines for minimal editing of automatic hypotheses.
We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
arXiv Detail & Related papers (2022-11-22T08:14:07Z) - A Textless Metric for Speech-to-Speech Comparison [20.658229254191266]
We introduce a new and simple method for comparing speech utterances without relying on text transcripts.
Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units.
arXiv Detail & Related papers (2022-10-21T09:28:54Z) - The Conversational Short-phrase Speaker Diarization (CSSD) Task:
Dataset, Evaluation Metric and Baselines [63.86406909879314]
This paper describes the Conversational Short-phrases Speaker Diarization (CSSD) task.
It consists of training and testing datasets, evaluation metric and baselines.
In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level.
arXiv Detail & Related papers (2022-08-17T03:26:23Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.