More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
- URL: http://arxiv.org/abs/2111.10139v1
- Date: Fri, 19 Nov 2021 10:23:38 GMT
- Title: More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
- Authors: Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford,
Miaosen Wang, Ye Jia, Tal Remez
- Abstract summary: Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text.
We demonstrate how this allows VDTTS to generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.
- Score: 9.035846000646481
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we present VDTTS, a Visually-Driven Text-to-Speech model.
Motivated by dubbing, VDTTS takes advantage of video frames as an additional
input alongside text, and generates speech that matches the video signal. We
demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech
that not only has prosodic variations like natural pauses and pitch, but is
also synchronized to the input video. Experimentally, we show our model
produces well synchronized outputs, approaching the video-speech
synchronization quality of the ground-truth, on several challenging benchmarks
including "in-the-wild" content from VoxCeleb2. We encourage the reader to view
the demo videos demonstrating video-speech synchronization, robustness to
speaker ID swapping, and prosody.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - Neural Dubber: Dubbing for Silent Videos According to Scripts [22.814626504851752]
We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task.
Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech.
Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
arXiv Detail & Related papers (2021-10-15T17:56:07Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - STYLER: Style Modeling with Rapidity and Robustness via
SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture.
Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.