Synchronization of Multiple Videos
- URL: http://arxiv.org/abs/2510.14051v1
- Date: Wed, 15 Oct 2025 19:43:57 GMT
- Title: Synchronization of Multiple Videos
- Authors: Avihai Naaman, Ron Shapira Weber, Oren Freifeld,
- Abstract summary: Synchronizing videos from different scenes or generative AI videos poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment.<n>We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by various pretrained models.<n>TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching.
- Score: 10.539720730126263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/
Related papers
- Reangle-A-Video: 4D Video Generation as Video-to-Video Translation [55.08100087149101]
We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video.<n>Our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors.
arXiv Detail & Related papers (2025-03-12T08:26:15Z) - Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network [57.72095897427665]
temporal sentence grounding (TSG) aims to locate query-relevant segments in videos.<n>Previous methods follow a single-thread framework that cannot co-train different pairs.<n>We propose Multi-Pair TSG, which aims to co-train these pairs.
arXiv Detail & Related papers (2024-12-20T08:50:11Z) - Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control.<n>Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time.<n>For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z) - SyncVIS: Synchronized Video Instance Segmentation [48.75470418596875]
We propose to conduct synchronized modeling via a new framework named SyncVIS.<n>SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings.<n>The proposed approach achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach.
arXiv Detail & Related papers (2024-12-01T16:43:20Z) - Sync-NeRF: Generalizing Dynamic NeRFs to Unsynchronized Videos [9.90835990611019]
We introduce time offsets for individual unsynchronized videos and jointly optimize the offsets with NeRF.
Finding the offsets naturally works as synchronizing the videos without manual effort.
arXiv Detail & Related papers (2023-10-20T08:45:30Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Sparse in Space and Time: Audio-visual Synchronisation with Trainable
Selectors [103.21152156339484]
The objective of this paper is audio-visual synchronisation of general videos 'in the wild'
We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs'selectors'
We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task.
arXiv Detail & Related papers (2022-10-13T14:25:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.