VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
- URL: http://arxiv.org/abs/2512.16906v1
- Date: Thu, 18 Dec 2025 18:58:42 GMT
- Title: VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
- Authors: Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma,
- Abstract summary: VIVA is a scalable framework for instruction-based video editing.<n>It uses VLM-guided encoding and reward optimization.<n>We show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
- Score: 31.89256250882701
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io
Related papers
- Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance [55.32799307123252]
We introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets.<n>We propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance.
arXiv Detail & Related papers (2026-03-02T18:46:28Z) - PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models [35.59605874012795]
PropFly is a training pipeline for propagation-based video editing.<n>PropFly relies on pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.<n>Our pipeline enables an adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss.
arXiv Detail & Related papers (2026-02-24T06:11:08Z) - Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations [8.479321655643195]
We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications.<n> RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes.<n> RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality.
arXiv Detail & Related papers (2025-11-18T03:37:19Z) - In-Context Learning with Unpaired Clips for Instruction-based Video Editing [51.943707933717185]
We introduce a low-cost pretraining strategy for instruction-based video editing.<n>Our framework first pretrains on approximately 1M real video clips to learn basic editing concepts.<n>Our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity.
arXiv Detail & Related papers (2025-10-16T13:02:11Z) - FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing [2.7248421583285265]
FlowDirector is a novel inversion-free video editing framework.<n>Our framework models the editing process as a direct evolution in data space.<n>To achieve localized and controllable edits, we introduce an attention-guided masking mechanism.
arXiv Detail & Related papers (2025-06-05T13:54:40Z) - VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation [70.87745520234012]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.<n>VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.<n>We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z) - A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model [10.736207095604414]
We propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM)
We also propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions.
arXiv Detail & Related papers (2024-11-07T18:20:28Z) - VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing [91.60658973688996]
We introduce VIA, a unified Video Adaptation framework for global local video editing, pushing the limits of consistently editing minute-long videos.<n>To ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model.<n>We show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.
arXiv Detail & Related papers (2024-06-18T17:51:37Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.