Related papers: VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization

URL: http://arxiv.org/abs/2512.16906v1
Date: Thu, 18 Dec 2025 18:58:42 GMT
Title: VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Authors: Xiaoyan Cong, Haotian Yang, Angtian Wang, Yizhi Wang, Yiding Yang, Canyu Zhang, Chongyang Ma,
Abstract summary: VIVA is a scalable framework for instruction-based video editing.<n>It uses VLM-guided encoding and reward optimization.<n>We show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
Score: 31.89256250882701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: https://viva-paper.github.io

Related papers

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance [55.32799307123252]
We introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets.<n>We propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance.
arXiv Detail & Related papers (2026-03-02T18:46:28Z)
PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models [35.59605874012795]
PropFly is a training pipeline for propagation-based video editing.<n>PropFly relies on pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.<n>Our pipeline enables an adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss.
arXiv Detail & Related papers (2026-02-24T06:11:08Z)
Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations [8.479321655643195]
We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications.<n> RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes.<n> RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality.
arXiv Detail & Related papers (2025-11-18T03:37:19Z)
In-Context Learning with Unpaired Clips for Instruction-based Video Editing [51.943707933717185]
We introduce a low-cost pretraining strategy for instruction-based video editing.<n>Our framework first pretrains on approximately 1M real video clips to learn basic editing concepts.<n>Our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity.
arXiv Detail & Related papers (2025-10-16T13:02:11Z)
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing [2.7248421583285265]
FlowDirector is a novel inversion-free video editing framework.<n>Our framework models the editing process as a direct evolution in data space.<n>To achieve localized and controllable edits, we introduce an attention-guided masking mechanism.
arXiv Detail & Related papers (2025-06-05T13:54:40Z)
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation [70.87745520234012]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z)
Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.<n>VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.<n>We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z)
A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model [10.736207095604414]
We propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) We also propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions.
arXiv Detail & Related papers (2024-11-07T18:20:28Z)
VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing [91.60658973688996]
We introduce VIA, a unified Video Adaptation framework for global local video editing, pushing the limits of consistently editing minute-long videos.<n>To ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model.<n>We show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.
arXiv Detail & Related papers (2024-06-18T17:51:37Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.