Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
- URL: http://arxiv.org/abs/2603.02175v2
- Date: Thu, 05 Mar 2026 17:36:07 GMT
- Title: Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
- Authors: Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou,
- Abstract summary: We introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets.<n>We propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance.
- Score: 55.32799307123252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.
Related papers
- NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing [26.74471251505078]
NOVA: Sparse Control & Dense Synthesis is a new framework for unpaired video editing.<n>Our experiments demonstrate that NOVA outperforms existing approaches in edit fidelity, motion preservation, and temporal coherence.
arXiv Detail & Related papers (2026-03-03T09:41:06Z) - EasyV2V: A High-quality Instruction-based Video Editing Framework [108.78294392167017]
captionemphEasyV2V is a framework for instruction-based video editing.<n>EasyV2V works with flexible inputs, e.g., video+text, video+mask+reference+, and state-of-the-art video editing results.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization [31.89256250882701]
VIVA is a scalable framework for instruction-based video editing.<n>It uses VLM-guided encoding and reward optimization.<n>We show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
arXiv Detail & Related papers (2025-12-18T18:58:42Z) - Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset [103.36732993526545]
We develop Ditto, a framework for instruction-based video editing.<n>We build a new dataset of one million high-fidelity video editing examples.<n>We train our model, Editto, on Ditto-1M with a curriculum learning strategy.
arXiv Detail & Related papers (2025-10-17T15:31:40Z) - EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z) - VidCLearn: A Continual Learning Approach for Text-to-Video Generation [11.861060763379236]
VidCLearn is a continual learning framework for text-to-video generation.<n>We introduce a novel temporal consistency loss to enhance motion smoothness and a video retrieval module to provide structural guidance at inference.<n>Our architecture is also designed to be more computationally efficient than existing models while retaining satisfactory generation performance.
arXiv Detail & Related papers (2025-09-21T07:34:19Z) - InstructVEdit: A Holistic Approach for Instructional Video Editing [28.13673601495108]
InstructVEdit is a full-cycle instructional video editing approach that establishes a reliable dataset curation workflow.<n>It incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency.<n>It also proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies.
arXiv Detail & Related papers (2025-03-22T04:12:20Z) - VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation [70.87745520234012]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.