Deep learning-based stereo camera multi-video synchronization
- URL: http://arxiv.org/abs/2303.12916v1
- Date: Wed, 22 Mar 2023 21:14:36 GMT
- Title: Deep learning-based stereo camera multi-video synchronization
- Authors: Nicolas Boizard, Kevin El Haddad, Thierry Ravet, Fran\c{c}ois Cresson
and Thierry Dutoit
- Abstract summary: A software-based synchronization method would reduce the cost, weight and size of the entire system.
This study paves the way to a production ready software-based video synchronization system.
- Score: 5.305803516459996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stereo vision is essential for many applications. Currently, the
synchronization of the streams coming from two cameras is done using mostly
hardware. A software-based synchronization method would reduce the cost, weight
and size of the entire system and allow for more flexibility when building such
systems. With this goal in mind, we present here a comparison of different deep
learning-based systems and prove that some are efficient and generalizable
enough for such a task. This study paves the way to a production ready
software-based video synchronization system.
Related papers
- JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation [108.21827580870979]
This paper presents JavisG, the first unified multimodal language model (MLLM) for joint audio-video (JAV) comprehension and generation.<n>JavisG has a encoder-LLM-decoder architecture, which has a SyncFusion module for concise-temporal large audio-video fusion.<n>We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning.
arXiv Detail & Related papers (2025-12-28T12:25:43Z) - RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems [38.099313678683224]
We present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems.<n>The proposed solution employs a custom-built itLED Clock that encodes time through red and infrared, allowing visual decoding of the exposure window.<n>We validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities.
arXiv Detail & Related papers (2025-11-18T22:13:06Z) - StereoSync: Spatially-Aware Stereo Audio Generation from Video [36.230236159381995]
StereoSync is a novel model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context.<n>We evaluate StereoSync on Walking The Maps, a dataset comprising videos from video games that feature animated characters walking through diverse environments.
arXiv Detail & Related papers (2025-10-07T11:51:58Z) - Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers [19.226787997122987]
We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs.<n>Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization.<n>Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality.
arXiv Detail & Related papers (2025-09-26T05:30:06Z) - Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization [1.7820202405704466]
Video synchronization is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems.<n>Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings.<n>We introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods.
arXiv Detail & Related papers (2025-06-19T00:41:21Z) - JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization [94.82127738291749]
JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts.
New benchmark, JavisBench, consists of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios.
arXiv Detail & Related papers (2025-03-30T09:40:42Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.
We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.
Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints [43.14498014617223]
We propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation.
We introduce a multi-view synchronization module to maintain appearance and geometry consistency across different viewpoints.
Our method enables intriguing extensions, such as re-rendering a video from novel viewpoints.
arXiv Detail & Related papers (2024-12-10T18:55:17Z) - MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos [104.1338295060383]
We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes.
Our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work.
arXiv Detail & Related papers (2024-12-05T18:59:42Z) - Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control [70.17137528953953]
Collaborative video diffusion (CVD) is trained on top of a state-of-the-art camera-control module for video generation.
CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines.
arXiv Detail & Related papers (2024-05-27T17:58:01Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Enabling Cross-Camera Collaboration for Video Analytics on Distributed
Smart Cameras [7.609628915907225]
We present Argus, a distributed video analytics system with cross-camera collaboration on smart cameras.
We identify multi-camera, multi-target tracking as the primary task multi-camera video analytics and develop a novel technique that avoids redundant, processing-heavy tasks.
Argus reduces the number of object identifications and end-to-end latency by up to 7.13x and 2.19x compared to the state-of-the-art.
arXiv Detail & Related papers (2024-01-25T12:27:03Z) - Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z) - Synchronized Smartphone Video Recording System of Depth and RGB Image
Frames with Sub-millisecond Precision [2.1286051580524523]
We propose a recording system with high time synchronization (sync) precision.
It consists of heterogeneous sensors such as smartphone, depth camera, IMU, etc.
arXiv Detail & Related papers (2021-11-05T15:16:54Z) - MFuseNet: Robust Depth Estimation with Learned Multiscopic Fusion [47.2251122861135]
We design a multiscopic vision system that utilizes a low-cost monocular RGB camera to acquire accurate depth estimation.
Unlike multi-view stereo with images captured at unconstrained camera poses, the proposed system controls the motion of a camera to capture a sequence of images.
arXiv Detail & Related papers (2021-08-05T08:31:01Z) - Single-Frame based Deep View Synchronization for Unsynchronized
Multi-Camera Surveillance [56.964614522968226]
Multi-camera surveillance has been an active research topic for understanding and modeling scenes.
It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks.
Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting.
arXiv Detail & Related papers (2020-07-08T04:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.