Assessing Evaluation Metrics for Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2110.13877v1
- Date: Tue, 26 Oct 2021 17:35:20 GMT
- Title: Assessing Evaluation Metrics for Speech-to-Speech Translation
- Authors: Elizabeth Salesky, Julian M\"ader, Severin Klinger
- Abstract summary: Speech-to-speech translation combines machine translation with speech synthesis.
How to automatically evaluate speech-to-speech translation is an open question which has not previously been explored.
- Score: 9.670709690031885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-to-speech translation combines machine translation with speech
synthesis, introducing evaluation challenges not present in either task alone.
How to automatically evaluate speech-to-speech translation is an open question
which has not previously been explored. Translating to speech rather than to
text is often motivated by unwritten languages or languages without
standardized orthographies. However, we show that the previously used automatic
metric for this task is best equipped for standardized high-resource languages
only. In this work, we first evaluate current metrics for speech-to-speech
translation, and second assess how translation to dialectal variants rather
than to standardized languages impacts various evaluation methods.
Related papers
- STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation [23.757896930482342]
This work explores the selection process through a study of downstream tasks.
Units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy.
arXiv Detail & Related papers (2024-07-08T08:53:26Z) - EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models [25.683827726880594]
We introduce EmphAssess, a benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis.
We apply this to two tasks: speech resynthesis and speech-to-speech translation.
In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output.
As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
arXiv Detail & Related papers (2023-12-21T17:47:33Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Direct Speech-to-speech Translation without Textual Annotation using
Bottleneck Features [13.44542301438426]
We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information.
Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach.
arXiv Detail & Related papers (2022-12-12T10:03:10Z) - Benchmarking Evaluation Metrics for Code-Switching Automatic Speech
Recognition [19.763431520942028]
We develop a benchmark data set of code-switching speech recognition hypotheses with human judgments.
We define clear guidelines for minimal editing of automatic hypotheses.
We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
arXiv Detail & Related papers (2022-11-22T08:14:07Z) - A Textless Metric for Speech-to-Speech Comparison [20.658229254191266]
We introduce a new and simple method for comparing speech utterances without relying on text transcripts.
Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units.
arXiv Detail & Related papers (2022-10-21T09:28:54Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z) - UWSpeech: Speech to Speech Translation for Unwritten Languages [145.37116196042282]
We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter.
We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition.
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
arXiv Detail & Related papers (2020-06-14T15:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.