SpeechAlign: a Framework for Speech Translation Alignment Evaluation
- URL: http://arxiv.org/abs/2309.11585v2
- Date: Thu, 25 Apr 2024 09:21:30 GMT
- Title: SpeechAlign: a Framework for Speech Translation Alignment Evaluation
- Authors: Belen Alastruey, Aleix Sant, Gerard I. Gállego, David Dale, Marta R. Costa-jussà,
- Abstract summary: SpeechAlign is a framework designed to evaluate the underexplored field of source-target alignment in speech models.
To tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset.
We also introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER)
- Score: 15.069228503777124
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. In our commitment to advance these fields, we present SpeechAlign, a framework designed to evaluate the underexplored field of source-target alignment in speech models. The SpeechAlign framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset, built upon a English-German text translation gold alignment dataset. Secondly, we introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER), which enable the evaluation of alignment quality within speech models. While the former gives equal importance to each word, the latter assigns weights based on the length of the words in the speech signal. By publishing SpeechAlign we provide an accessible evaluation framework for model assessment, and we employ it to benchmark open-source Speech Translation models. In doing so, we contribute to the ongoing research progress within the fields of Speech-to-Speech and Speech-to-Text translation.
Related papers
- EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models [25.683827726880594]
We introduce EmphAssess, a benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis.
We apply this to two tasks: speech resynthesis and speech-to-speech translation.
In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output.
As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
arXiv Detail & Related papers (2023-12-21T17:47:33Z) - Understanding Shared Speech-Text Representations [34.45772613231558]
Mae-stro has developed approaches to train speech models by incorpo-rating text into end-to-end models.
We find that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation.
We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders.
arXiv Detail & Related papers (2023-04-27T20:05:36Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - LibriS2S: A German-English Speech-to-Speech Translation Corpus [12.376309678270275]
We create the first publicly available speech-to-speech training corpus between German and English.
This allows the creation of a new text-to-speech and speech-to-speech translation model.
We propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model.
arXiv Detail & Related papers (2022-04-22T09:33:31Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.