EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in
Speech-to-Speech Models
- URL: http://arxiv.org/abs/2312.14069v1
- Date: Thu, 21 Dec 2023 17:47:33 GMT
- Title: EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in
Speech-to-Speech Models
- Authors: Maureen de Seyssel, Antony D'Avirro, Adina Williams, Emmanuel Dupoux
- Abstract summary: We introduce EmphAssess, a benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis.
We apply this to two tasks: speech resynthesis and speech-to-speech translation.
In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output.
As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
- Score: 28.05773667801356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce EmphAssess, a prosodic benchmark designed to evaluate the
capability of speech-to-speech models to encode and reproduce prosodic
emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech
translation. In both cases, the benchmark evaluates the ability of the model to
encode emphasis in the speech input and accurately reproduce it in the output,
potentially across a change of speaker and language. As part of the evaluation
pipeline, we introduce EmphaClass, a new model that classifies emphasis at the
frame or word level.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author.
In this paper, we explore the attribution of transcribed speech, which poses novel challenges.
We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z) - SpeechAlign: a Framework for Speech Translation Alignment Evaluation [15.069228503777124]
SpeechAlign is a framework designed to evaluate the underexplored field of source-target alignment in speech models.
To tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset.
We also introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER)
arXiv Detail & Related papers (2023-09-20T18:46:37Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Direct Speech-to-speech Translation without Textual Annotation using
Bottleneck Features [13.44542301438426]
We propose a direct speech-to-speech translation model which can be trained without any textual annotation or content information.
Experiments on Mandarin-Cantonese speech translation demonstrate the feasibility of the proposed approach.
arXiv Detail & Related papers (2022-12-12T10:03:10Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - SpeechPainter: Text-conditioned Speech Inpainting [12.027499164122492]
We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input.
We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions.
arXiv Detail & Related papers (2022-02-15T09:33:30Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.