SpeechPainter: Text-conditioned Speech Inpainting
- URL: http://arxiv.org/abs/2202.07273v1
- Date: Tue, 15 Feb 2022 09:33:30 GMT
- Title: SpeechPainter: Text-conditioned Speech Inpainting
- Authors: Zal\'an Borsos, Matt Sharifi, Marco Tagliasacchi
- Abstract summary: We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input.
We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions.
- Score: 12.027499164122492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose SpeechPainter, a model for filling in gaps of up to one second in
speech samples by leveraging an auxiliary textual input. We demonstrate that
the model performs speech inpainting with the appropriate content, while
maintaining speaker identity, prosody and recording environment conditions, and
generalizing to unseen speakers. Our approach significantly outperforms
baselines constructed using adaptive TTS, as judged by human raters in
side-by-side preference and MOS tests.
Related papers
- Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments [0.2916558661202724]
We develop a transformer-based model that jointly performs speech recognition and speaker identification.
We show that the joint model performs comparably to Whisper under clean conditions.
Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
arXiv Detail & Related papers (2024-10-07T18:39:59Z) - DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models [25.683827726880594]
We introduce EmphAssess, a benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis.
We apply this to two tasks: speech resynthesis and speech-to-speech translation.
In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output.
As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
arXiv Detail & Related papers (2023-12-21T17:47:33Z) - uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.
Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.