Synchformer: Efficient Synchronization from Sparse Cues
- URL: http://arxiv.org/abs/2401.16423v1
- Date: Mon, 29 Jan 2024 18:59:55 GMT
- Title: Synchformer: Efficient Synchronization from Sparse Cues
- Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman
- Abstract summary: Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
- Score: 100.89656994681934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild'
videos, such as those on YouTube, where synchronization cues can be sparse. Our
contributions include a novel audio-visual synchronization model, and training
that decouples feature extraction from synchronization modelling through
multi-modal segment-level contrastive pre-training. This approach achieves
state-of-the-art performance in both dense and sparse settings. We also extend
synchronization model training to AudioSet a million-scale 'in-the-wild'
dataset, investigate evidence attribution techniques for interpretability, and
explore a new capability for synchronization models: audio-visual
synchronizability.
Related papers
- ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer [87.32518573172631]
ReSyncer fuses motion and appearance with unified training.
It supports fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping.
arXiv Detail & Related papers (2024-08-06T16:31:45Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - On the Audio-visual Synchronization for Lip-to-Speech Synthesis [22.407313748927393]
We show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues.
Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync.
arXiv Detail & Related papers (2023-03-01T13:35:35Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync.
We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length.
We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z) - Self-Supervised Audio-Visual Representation Learning with Relaxed
Cross-Modal Temporal Synchronicity [12.995632804090198]
CrissCross is a self-supervised framework for learning audio-visual representations.
We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong time-invariant representations.
arXiv Detail & Related papers (2021-11-09T20:24:19Z) - FoleyGAN: Visually Guided Generative Adversarial Network-Based
Synchronous Sound Generation in Silent Videos [0.0]
We introduce a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task.
Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading towards generating visually aligned realistic sound tracks.
arXiv Detail & Related papers (2021-07-20T04:59:26Z) - Single-Frame based Deep View Synchronization for Unsynchronized
Multi-Camera Surveillance [56.964614522968226]
Multi-camera surveillance has been an active research topic for understanding and modeling scenes.
It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks.
Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting.
arXiv Detail & Related papers (2020-07-08T04:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.