Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding
- URL: http://arxiv.org/abs/2510.12256v1
- Date: Tue, 14 Oct 2025 08:05:30 GMT
- Title: Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding
- Authors: Ye Chen, Liming Tan, Yupeng Zhu, Yuanbin Wang, Bingbing Ni,
- Abstract summary: A proposed representation achieves high video reconstruction accuracy with fewer parameters.<n>It supports complex video processing tasks, including video in-painting and temporally consistent video editing.
- Score: 45.593989778240655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current video representations heavily rely on unstable and over-grained priors for motion and appearance modelling, \emph{i.e.}, pixel-level matching and tracking. A tracking error of just a few pixels would lead to the collapse of the visual object representation, not to mention occlusions and large motion frequently occurring in videos. To overcome the above mentioned vulnerability, this work proposes spatio-temporally consistent proxy nodes to represent dynamically changing objects/scenes in the video. On the one hand, the hierarchical proxy nodes have the ability to stably express the multi-scale structure of visual objects, so they are not affected by accumulated tracking error, long-term motion, occlusion, and viewpoint variation. On the other hand, the dynamic representation update mechanism of the proxy nodes adequately leverages spatio-temporal priors of the video to mitigate the impact of inaccurate trackers, thereby effectively handling drastic changes in scenes and objects. Additionally, the decoupled encoding manner of the shape and texture representations across different visual objects in the video facilitates controllable and fine-grained appearance editing capability. Extensive experiments demonstrate that the proposed representation achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks, including video in-painting and keyframe-based temporally consistent video editing.
Related papers
- PISCO: Precise Video Instance Insertion with Sparse Control [20.322731348056543]
In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications.<n>A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage.<n>We propose PISCO, a video model for precise video insertion instance with arbitrary sparse control.
arXiv Detail & Related papers (2026-02-09T05:15:39Z) - Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z) - Point-to-Point: Sparse Motion Guidance for Controllable Video Editing [29.888408281118846]
We propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model.<n>In experiments, anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.
arXiv Detail & Related papers (2025-11-23T03:59:59Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control [66.66226299852559]
VideoAnydoor is a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control.<n>To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper.
arXiv Detail & Related papers (2025-01-02T18:59:54Z) - InstMove: Instance Motion for Object-centric Video Segmentation [70.16915119724757]
In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video.
In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings.
With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks.
arXiv Detail & Related papers (2023-03-14T17:58:44Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.