NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
- URL: http://arxiv.org/abs/2603.02802v1
- Date: Tue, 03 Mar 2026 09:41:06 GMT
- Title: NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
- Authors: Tianlin Pan, Jiayi Dai, Chenpu Yuan, Zhengyao Lv, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu, Caifeng Shan, Chenyang Si,
- Abstract summary: NOVA: Sparse Control & Dense Synthesis is a new framework for unpaired video editing.<n>Our experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
- Score: 26.74471251505078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent video editing models have achieved impressive results, but most still require large-scale paired datasets. Collecting such naturally aligned pairs at scale remains highly challenging and constitutes a critical bottleneck, especially for local video editing data. Existing workarounds transfer image editing to video through global motion control for pair-free video editing, but such designs struggle with background and temporal consistency. In this paper, we propose NOVA: Sparse Control \& Dense Synthesis, a new framework for unpaired video editing. Specifically, the sparse branch provides semantic guidance through user-edited keyframes distributed across the video, and the dense branch continuously incorporates motion and texture information from the original video to maintain high fidelity and coherence. Moreover, we introduce a degradation-simulation training strategy that enables the model to learn motion reconstruction and temporal consistency by training on artificially degraded videos, thus eliminating the need for paired data. Our extensive experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
Related papers
- EasyV2V: A High-quality Instruction-based Video Editing Framework [108.78294392167017]
captionemphEasyV2V is a framework for instruction-based video editing.<n>EasyV2V works with flexible inputs, e.g., video+text, video+mask+reference+, and state-of-the-art video editing results.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - Low-Cost Test-Time Adaptation for Robust Video Editing [4.707015344498921]
Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives.<n>Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures.<n>We present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks.
arXiv Detail & Related papers (2025-07-29T14:31:17Z) - Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - SketchVideo: Sketch-based Video Generation and Editing [51.99066098393491]
We aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos.<n>Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks.<n>For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion.
arXiv Detail & Related papers (2025-03-30T02:44:09Z) - HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks [4.536530093400348]
Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video.<n>We propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos.<n>Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos.
arXiv Detail & Related papers (2025-03-21T16:24:47Z) - Edit as You See: Image-guided Video Editing via Masked Motion Modeling [18.89936405508778]
We propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff.<n>IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video.<n>Our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.
arXiv Detail & Related papers (2025-01-08T07:52:12Z) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.<n>VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.<n>We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.