Related papers: EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

URL: http://arxiv.org/abs/2312.14069v1
Date: Thu, 21 Dec 2023 17:47:33 GMT
Title: EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
Authors: Maureen de Seyssel, Antony D'Avirro, Adina Williams, Emmanuel Dupoux
Abstract summary: We introduce EmphAssess, a benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
Score: 28.05773667801356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.

Related papers

SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z)
STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text. We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z)
SpeechAlign: a Framework for Speech Translation Alignment Evaluation [15.069228503777124]
SpeechAlign is a framework designed to evaluate the underexplored field of source-target alignment in speech models. To tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset. We also introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER)
arXiv Detail & Related papers (2023-09-20T18:46:37Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features [13.44542301438426]
We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information. Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach.
arXiv Detail & Related papers (2022-12-12T10:03:10Z)
SpeechPainter: Text-conditioned Speech Inpainting [12.027499164122492]
We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions.
arXiv Detail & Related papers (2022-02-15T09:33:30Z)
On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.