Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion
- URL: http://arxiv.org/abs/2512.02017v1
- Date: Mon, 01 Dec 2025 18:59:57 GMT
- Title: Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion
- Authors: Shaowei Liu, David Yifan Yao, Saurabh Gupta, Shenlong Wang,
- Abstract summary: We present VisualSync, an optimization framework that aligns unposed, unsynchronized videos at millisecond accuracy.<n>Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized.<n>VisualSync exploits off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences.
- Score: 30.873271334433024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
Related papers
- SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization for Multi-Video 4D Gaussian Splatting [50.69165364520998]
We present a novel multi-video 4D Gaussian Splatting (4DGS) approach designed to handle real-world, unsynchronized video sets.<n>Our approach, SyncTrack4D, directly leverages dense 4D track representation of dynamic scene parts as cues for simultaneous cross-video synchronization and 4DGS reconstruction.<n>We evaluate our approach on the Panoptic Studio and SyncNeRF Blender, demonstrating sub-frame synchronization accuracy with an average temporal error below 0.26 frames, and high-fidelity 4D reconstruction reaching 26.3 PSNR scores.
arXiv Detail & Related papers (2025-12-03T23:05:01Z) - RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems [38.099313678683224]
We present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems.<n>The proposed solution employs a custom-built itLED Clock that encodes time through red and infrared, allowing visual decoding of the exposure window.<n>We validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities.
arXiv Detail & Related papers (2025-11-18T22:13:06Z) - Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers [19.226787997122987]
We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs.<n>Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization.<n>Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality.
arXiv Detail & Related papers (2025-09-26T05:30:06Z) - SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting [25.523486023087916]
A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.<n>We introduce SyncTalk++ to address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads.<n>Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second.
arXiv Detail & Related papers (2025-06-17T17:22:12Z) - CoMotion: Concurrent Multi-person 3D Motion [88.27833466761234]
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream.<n>Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame.<n>We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy.
arXiv Detail & Related papers (2025-04-16T15:40:15Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos [9.90835990611019]
We introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF.
Finding the offsets naturally works as synchronizing the videos without manual effort.
arXiv Detail & Related papers (2023-10-20T08:45:30Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - Single-Frame based Deep View Synchronization for Unsynchronized
Multi-Camera Surveillance [56.964614522968226]
Multi-camera surveillance has been an active research topic for understanding and modeling scenes.
It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks.
Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting.
arXiv Detail & Related papers (2020-07-08T04:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.