EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
- URL: http://arxiv.org/abs/2509.20360v2
- Date: Thu, 25 Sep 2025 22:11:13 GMT
- Title: EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
- Authors: Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, Qiang Xu,
- Abstract summary: We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
- Score: 58.53074381801114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.
Related papers
- Instruction-based Image Editing with Planning, Reasoning, and Generation [52.0364486403062]
Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task.<n>We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models.<n>Our method has competitive editing abilities on complex real-world images.
arXiv Detail & Related papers (2026-02-26T04:56:02Z) - In-Context Learning with Unpaired Clips for Instruction-based Video Editing [51.943707933717185]
We introduce a low-cost pretraining strategy for instruction-based video editing.<n>Our framework first pretrains on approximately 1M real video clips to learn basic editing concepts.<n>Our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity.
arXiv Detail & Related papers (2025-10-16T13:02:11Z) - DreamVE: Unified Instruction-based Image and Video Editing [48.59380808274814]
We introduce DreamVE, a unified model for instruction-based image and video editing.<n>We propose a two-stage training strategy: first image editing, then video editing.<n>We present comprehensive training data pipelines, including collage-based and generative model-based data synthesis.
arXiv Detail & Related papers (2025-08-08T07:20:30Z) - VINCIE: Unlocking In-context Image Editing from Video [62.88977098700917]
In this work, we explore whether an in-context image editing model can be learned directly from videos.<n>To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks.<n>Our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks.
arXiv Detail & Related papers (2025-06-12T17:46:54Z) - InstructVEdit: A Holistic Approach for Instructional Video Editing [28.13673601495108]
InstructVEdit is a full-cycle instructional video editing approach that establishes a reliable dataset curation workflow.<n>It incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency.<n>It also proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies.
arXiv Detail & Related papers (2025-03-22T04:12:20Z) - VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [9.287394166165424]
We introduce EffiVED, an efficient diffusion-based model that supports instruction-guided video editing.
We transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED.
arXiv Detail & Related papers (2024-03-18T08:42:08Z) - Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist.
Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model.
Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z) - Structure and Content-Guided Video Synthesis with Diffusion Models [13.464501385061032]
We present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output.
Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
arXiv Detail & Related papers (2023-02-06T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.