EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis
- URL: http://arxiv.org/abs/2308.05725v1
- Date: Thu, 10 Aug 2023 17:41:19 GMT
- Title: EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis
- Authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat,
Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid,
Felix Kreuk, Yossi Adi, Emmanuel Dupoux
- Abstract summary: We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
- Score: 49.04496602282718
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent work has shown that it is possible to resynthesize high-quality speech
based, not on text, but on low bitrate discrete units that have been learned in
a self-supervised fashion and can therefore capture expressive aspects of
speech that are hard to transcribe (prosody, voice styles, non-verbal
vocalization). The adoption of these methods is still limited by the fact that
most speech synthesis datasets are read, severely limiting spontaneity and
expressivity. Here, we introduce Expresso, a high-quality expressive speech
dataset for textless speech synthesis that includes both read speech and
improvised dialogues rendered in 26 spontaneous expressive styles. We
illustrate the challenges and potentials of this dataset with an expressive
resynthesis benchmark where the task is to encode the input in low-bitrate
units and resynthesize it in a target voice while preserving content and style.
We evaluate resynthesis quality with automatic metrics for different
self-supervised discrete encoders, and explore tradeoffs between quality,
bitrate and invariance to speaker and style. All the dataset, evaluation
metrics and baseline models are open source
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser.
We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised.
We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z) - Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author.
In this paper, we explore the attribution of transcribed speech, which poses novel challenges.
We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Towards zero-shot Text-based voice editing using acoustic context
conditioning, utterance embeddings, and reference encoders [14.723225542605105]
Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording.
Recent work has used neural models to produce edited speech similar to the original speech in terms of clarity, speaker identity, and prosody.
This work focuses on the zero-shot approach which avoids finetuning altogether.
arXiv Detail & Related papers (2022-10-28T10:31:44Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.