VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
- URL: http://arxiv.org/abs/2204.02090v1
- Date: Tue, 5 Apr 2022 10:02:39 GMT
- Title: VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
- Authors: Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro
- Abstract summary: We address the problem of lip-voice synchronisation in videos containing human face and voice.
Our approach is based on determining if the lips motion and the voice in a video are synchronised or not.
We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models.
- Score: 4.167459103689587
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the problem of lip-voice synchronisation in videos
containing human face and voice. Our approach is based on determining if the
lips motion and the voice in a video are synchronised or not, depending on
their audio-visual correspondence score. We propose an audio-visual cross-modal
transformer-based model that outperforms several baseline models in the
audio-visual synchronisation task on the standard lip-reading speech benchmark
dataset LRS2. While the existing methods focus mainly on the lip
synchronisation in speech videos, we also consider the special case of singing
voice. Singing voice is a more challenging use case for synchronisation due to
sustained vowel sounds. We also investigate the relevance of lip
synchronisation models trained on speech datasets in the context of singing
voice. Finally, we use the frozen visual features learned by our lip
synchronisation model in the singing voice separation task to outperform a
baseline audio-visual model which was trained end-to-end. The demos, source
code, and the pre-trained model will be made available on
https://ipcv.github.io/VocaLiST/
Related papers
- Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - Karaoker: Alignment-free singing voice synthesis with speech training
data [3.9795908407245055]
Karaoker is a multispeaker Tacotron-based model conditioned on voice characteristic features.
The model is jointly conditioned with a single deep convolutional encoder on continuous data.
We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks.
arXiv Detail & Related papers (2022-04-08T15:33:59Z) - VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z) - A Melody-Unsupervision Model for Singing Voice Synthesis [9.137554315375919]
We propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time.
We show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.
arXiv Detail & Related papers (2021-10-13T07:42:35Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z) - A cappella: Audio-visual Singing Voice Separation [4.6453787256723365]
We explore the single-channel singing voice separation problem from a multimodal perspective.
We present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube.
We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results.
arXiv Detail & Related papers (2021-04-20T13:17:06Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.