Related papers: In-Context Learning with Unpaired Clips for Instruction-based Video Editing

In-Context Learning with Unpaired Clips for Instruction-based Video Editing

URL: http://arxiv.org/abs/2510.14648v1
Date: Thu, 16 Oct 2025 13:02:11 GMT
Title: In-Context Learning with Unpaired Clips for Instruction-based Video Editing
Authors: Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, Guosheng Lin,
Abstract summary: We introduce a low-cost pretraining strategy for instruction-based video editing.<n>Our framework first pretrains on approximately 1M real video clips to learn basic editing concepts.<n>Our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity.
Score: 51.943707933717185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12\% improvement in editing instruction following and a 15\% improvement in editing quality.

Related papers

Region-Constraint In-Context Generation for Instructional Video Editing [91.27224696009755]
We present ReCo, a new instructional video editing paradigm that delves into constraint modeling between editing and non-editing regions during in-context generation.<n>We propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training.
arXiv Detail & Related papers (2025-12-19T14:49:30Z)
EasyV2V: A High-quality Instruction-based Video Editing Framework [108.78294392167017]
captionemphEasyV2V is a framework for instruction-based video editing.<n>EasyV2V works with flexible inputs, e.g., video+text, video+mask+reference+, and state-of-the-art video editing results.
arXiv Detail & Related papers (2025-12-18T18:59:57Z)
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z)
InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction [10.855393943204728]
We present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M.<n>We generate and filter a variety of video editing triplets from high-quality images.<n>Experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works.
arXiv Detail & Related papers (2025-03-26T07:30:58Z)
InstructVEdit: A Holistic Approach for Instructional Video Editing [28.13673601495108]
InstructVEdit is a full-cycle instructional video editing approach that establishes a reliable dataset curation workflow.<n>It incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency.<n>It also proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies.
arXiv Detail & Related papers (2025-03-22T04:12:20Z)
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z)
Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists [17.451911831989293]
We introduce Senorita-2M, a high-quality video editing dataset.<n>It is built by crafting four high-quality, specialized video editing models.<n>We propose a filtering pipeline to eliminate poorly edited video pairs.
arXiv Detail & Related papers (2025-02-10T17:58:22Z)
A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model [10.736207095604414]
We propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) We also propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions.
arXiv Detail & Related papers (2024-11-07T18:20:28Z)
InstructBrush: Learning Attention-based Instruction Optimization for Image Editing [54.07526261513434]
InstructBrush is an inversion method for instruction-based image editing methods. It extracts editing effects from image pairs as editing instructions, which are further applied for image editing. Our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
arXiv Detail & Related papers (2024-03-27T15:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.