Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization
- URL: http://arxiv.org/abs/2506.15937v1
- Date: Thu, 19 Jun 2025 00:41:21 GMT
- Title: Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization
- Authors: Yosub Shin, Igor Molybog,
- Abstract summary: Video synchronization is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems.<n>Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings.<n>We introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods.
- Score: 1.7820202405704466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video synchronization-aligning multiple video streams capturing the same event from different angles-is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems. Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings where such signals may be unreliable or absent. Additionally, existing benchmarks for video synchronization lack generality and reproducibility, restricting progress in the field. In this work, we introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods, such as human pose estimation, enabling broader applicability across different content types. We evaluate our system on newly composed datasets covering single-human, multi-human, and non-human scenarios, providing both the methodology and code for dataset creation to establish reproducible benchmarks. Our analysis reveals biases in prior SOTA work, particularly in SeSyn-Net's preprocessing pipeline, leading to inflated performance claims. We correct these biases and propose a more rigorous evaluation framework, demonstrating that VideoSync outperforms existing approaches, including SeSyn-Net, under fair experimental conditions. Additionally, we explore various synchronization offset prediction methods, identifying a convolutional neural network (CNN)-based model as the most effective. Our findings advance video synchronization beyond domain-specific constraints, making it more generalizable and robust for real-world applications.
Related papers
- SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction [17.3162098606835]
We propose a multi-modal framework for Synchronous Video Prediction (SyncVP)<n>SyncVP builds on modality-trained models and introduces an efficient pre-temporal cross-attention module.<n> Notably, SyncVP achieves state-of-the-art performance, even in scenarios where only one modality is present.
arXiv Detail & Related papers (2025-03-24T17:53:44Z) - UniSync: A Unified Framework for Audio-Visual Synchronization [7.120340851879775]
We present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities.<n>We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs.<n>UniSync outperforms existing methods on standard datasets.
arXiv Detail & Related papers (2025-03-20T17:16:03Z) - SyncVIS: Synchronized Video Instance Segmentation [48.75470418596875]
We propose to conduct synchronized modeling via a new framework named SyncVIS.<n>SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings.<n>The proposed approach achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach.
arXiv Detail & Related papers (2024-12-01T16:43:20Z) - Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR)
Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query.
Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z) - SyncTweedies: A General Generative Framework Based on Synchronized Diffusions [11.292617528150291]
We present exhaustive investigation into all possible scenarios for synchronizing multiple diffusion processes through a canonical space.
We reveal a previously unexplored case: averaging the outputs of Tweedie's formula while conducting denoising in multiple instance spaces.
In our experiments generating visual content aforementioned, we demonstrate the superior quality of generation by SyncTweedies compared to other synchronization methods.
arXiv Detail & Related papers (2024-03-21T12:57:30Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - Single-Frame based Deep View Synchronization for Unsynchronized
Multi-Camera Surveillance [56.964614522968226]
Multi-camera surveillance has been an active research topic for understanding and modeling scenes.
It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks.
Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting.
arXiv Detail & Related papers (2020-07-08T04:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.