DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding
- URL: http://arxiv.org/abs/2308.07787v1
- Date: Tue, 15 Aug 2023 14:07:41 GMT
- Title: DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding
- Authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro
- Abstract summary: We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
- Score: 52.84475402151201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has demonstrated impressive results in video-to-speech
synthesis which involves reconstructing speech solely from visual input.
However, previous works have struggled to accurately synthesize speech due to a
lack of sufficient guidance for the model to infer the correct content with the
appropriate sound. To resolve the issue, they have adopted an extra speaker
embedding as a speaking style guidance from a reference auditory information.
Nevertheless, it is not always possible to obtain the audio information from
the corresponding video input, especially during the inference time. In this
paper, we present a novel vision-guided speaker embedding extractor using a
self-supervised pre-trained model and prompt tuning technique. In doing so, the
rich speaker embedding information can be produced solely from input visual
information, and the extra audio information is not necessary during the
inference time. Using the extracted vision-guided speaker embedding
representations, we further develop a diffusion-based video-to-speech synthesis
model, so called DiffV2S, conditioned on those speaker embeddings and the
visual representation extracted from the input video. The proposed DiffV2S not
only maintains phoneme details contained in the input video frames, but also
creates a highly intelligible mel-spectrogram in which the speaker identities
of the multiple speakers are all preserved. Our experimental results show that
DiffV2S achieves the state-of-the-art performance compared to the previous
video-to-speech synthesis technique.
Related papers
- Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Vocoder-Based Speech Synthesis from Silent Videos [28.94460283719776]
We present a way to synthesise speech from the silent video of a talker using deep learning.
The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.
arXiv Detail & Related papers (2020-04-06T10:22:04Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.