GestSync: Determining who is speaking without a talking head
- URL: http://arxiv.org/abs/2310.05304v1
- Date: Sun, 8 Oct 2023 22:48:30 GMT
- Title: GestSync: Determining who is speaking without a talking head
- Authors: Sindhu B Hegde and Andrew Zisserman
- Abstract summary: We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
- Score: 67.75387744442727
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we introduce a new synchronisation task, Gesture-Sync:
determining if a person's gestures are correlated with their speech or not. In
comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far
looser relationship between the voice and body movement than there is between
voice and lip motion. We introduce a dual-encoder model for this task, and
compare a number of input representations including RGB frames, keypoint
images, and keypoint vectors, assessing their performance and advantages. We
show that the model can be trained using self-supervised learning alone, and
evaluate its performance on the LRS3 dataset. Finally, we demonstrate
applications of Gesture-Sync for audio-visual synchronisation, and in
determining who is the speaker in a crowd, without seeing their faces. The
code, datasets and pre-trained models can be found at:
\url{https://www.robots.ox.ac.uk/~vgg/research/gestsync}.
Related papers
- ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization [29.53063463863921]
We present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision.
By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner.
Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.
arXiv Detail & Related papers (2024-06-18T03:14:22Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis [24.565073576385913]
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.
Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity.
NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis.
arXiv Detail & Related papers (2023-11-29T12:35:34Z) - On the Audio-visual Synchronization for Lip-to-Speech Synthesis [22.407313748927393]
We show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues.
Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync.
arXiv Detail & Related papers (2023-03-01T13:35:35Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - Rethinking Audio-visual Synchronization for Active Speaker Detection [62.95962896690992]
Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
arXiv Detail & Related papers (2022-06-21T14:19:06Z) - VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices [4.167459103689587]
We address the problem of lip-voice synchronisation in videos containing human face and voice.
Our approach is based on determining if the lips motion and the voice in a video are synchronised or not.
We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models.
arXiv Detail & Related papers (2022-04-05T10:02:39Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.