Related papers: ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

URL: http://arxiv.org/abs/2512.09924v2
Date: Thu, 11 Dec 2025 02:30:12 GMT
Title: ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
Authors: Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo,
Abstract summary: Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing.<n>We introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing.<n>We propose ReViSE, a framework that unifies generation and evaluation within a single architecture.
Score: 57.08352504712699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.

Related papers

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance [55.32799307123252]
We introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets.<n>We propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance.
arXiv Detail & Related papers (2026-03-02T18:46:28Z)
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models [60.902953259781675]
A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder.<n>We show that unlocking the reasoning capabilities of MLLM can push the boundaries of editing models.<n>Our proposed framework enables image editing in a thinking-editing-reflection loop.
arXiv Detail & Related papers (2025-11-27T17:02:48Z)
Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations [8.479321655643195]
We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications.<n> RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes.<n> RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality.
arXiv Detail & Related papers (2025-11-18T03:37:19Z)
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding [23.684146245231457]
Long-form video understanding involves richer and more dynamic visual input.<n> purely text-based reflection mechanisms lack cross-modal interaction capabilities.<n>We propose REVISOR, a novel framework for tool-augmented multimodal reflection.
arXiv Detail & Related papers (2025-11-17T06:25:12Z)
Taming Flow-based I2V Models for Creative Video Editing [64.67801702413122]
Video editing, which aims to manipulate videos according to user intent, remains an emerging challenge.<n>Most existing image-conditioned video editing methods require inversion with model-specific design or need extensive optimization.<n>We propose IF-V2V, an Inversion-Free method that can adapt off-the-shelf flow-matching-based I2V models for video editing without significant computational overhead.
arXiv Detail & Related papers (2025-09-26T05:57:04Z)
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z)
Low-Cost Test-Time Adaptation for Robust Video Editing [4.707015344498921]
Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives.<n>Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures.<n>We present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks.
arXiv Detail & Related papers (2025-07-29T14:31:17Z)
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [11.09708780767668]
We present a shape-consistent video editing method, namely StableV2V, in this paper. Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment. Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.
arXiv Detail & Related papers (2024-11-17T11:48:01Z)
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models [16.045012576543474]
Text-based video editing has emerged as a promising field, enabling precise modifications to videos based on text prompts.<n>Existing evaluations are limited and inconsistent, typically summarizing overall performance with a single score.<n>We propose EditBoard, the first comprehensive evaluation benchmark for text-based video editing models.
arXiv Detail & Related papers (2024-09-15T08:43:18Z)
Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content. We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z)
In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing [28.790900756506833]
3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos, subsequently enabling diverse editing tasks through manipulation of this latent code. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs.
arXiv Detail & Related papers (2023-02-09T18:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.