Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations
- URL: http://arxiv.org/abs/2104.00355v2
- Date: Fri, 2 Apr 2021 09:48:31 GMT
- Title: Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations
- Authors: Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal
Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux
- Abstract summary: We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
- Score: 49.48053138928408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose using self-supervised discrete representations for the task of
speech resynthesis. To generate disentangled representation, we separately
extract low-bitrate representations for speech content, prosodic information,
and speaker identity. This allows to synthesize speech in a controllable
manner. We analyze various state-of-the-art, self-supervised representation
learning methods and shed light on the advantages of each method while
considering reconstruction quality and disentanglement properties.
Specifically, we evaluate the F0 reconstruction, speaker identification
performance (for both resynthesis and voice conversion), recordings'
intelligibility, and overall quality using subjective human evaluation. Lastly,
we demonstrate how these representations can be used for an ultra-lightweight
speech codec. Using the obtained representations, we can get to a rate of 365
bits per second while providing better speech quality than the baseline
methods. Audio samples can be found under https://resynthesis-ssl.github.io/.
Related papers
- Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - DSVAE: Interpretable Disentangled Representation for Synthetic Speech
Detection [25.451749986565375]
We propose Dis Spectrogram Variational Autoentangle (DSVAE) to generate interpretable representations of a speech signal for detecting synthetic speech.
Our experimental results show high accuracy (>98%) on detecting synthetic speech from 6 known and 10 out of 11 unknown speech synthesizers.
arXiv Detail & Related papers (2023-04-06T18:37:26Z) - ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement [40.29155338515071]
ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
arXiv Detail & Related papers (2022-12-21T21:36:52Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.