Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
- URL: http://arxiv.org/abs/2511.14100v1
- Date: Tue, 18 Nov 2025 03:37:19 GMT
- Title: Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
- Authors: Yiqing Shen, Chenjia Li, Mathias Unberath,
- Abstract summary: We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications.<n> RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes.<n> RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality.
- Score: 8.479321655643195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.
Related papers
- VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization [31.89256250882701]
VIVA is a scalable framework for instruction-based video editing.<n>It uses VLM-guided encoding and reward optimization.<n>We show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
arXiv Detail & Related papers (2025-12-18T18:58:42Z) - ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning [57.08352504712699]
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing.<n>We introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing.<n>We propose ReViSE, a framework that unifies generation and evaluation within a single architecture.
arXiv Detail & Related papers (2025-12-10T18:57:09Z) - Beyond Simple Edits: Composed Video Retrieval with Dense Modifications [96.46069692338645]
We introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments.<n>Dense-WebVid-CoVR consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart.<n>We develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion.
arXiv Detail & Related papers (2025-08-19T17:59:39Z) - UNIC: Unified In-Context Video Editing [76.76077875564526]
UNified In-Context Video Editing (UNIC) is a framework that unifies diverse video editing tasks within a single model in an in-context manner.<n>We introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks.<n>Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
arXiv Detail & Related papers (2025-06-04T17:57:43Z) - VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation [70.87745520234012]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework.<n>It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback.<n>VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z) - StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [11.09708780767668]
We present a shape-consistent video editing method, namely StableV2V, in this paper.
Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment.
Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.
arXiv Detail & Related papers (2024-11-17T11:48:01Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [74.01707548681405]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.<n>Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.<n>The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.