Related papers: Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

Video4Edit: Viewing Image Editing as a Degenerate Temporal Process

URL: http://arxiv.org/abs/2511.18131v1
Date: Sat, 22 Nov 2025 17:30:55 GMT
Title: Video4Edit: Viewing Image Editing as a Degenerate Temporal Process
Authors: Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang,
Abstract summary: multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime.<n>We revisit this challenge through the lens of temporal modeling.<n>This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime.
Score: 24.8621496006791
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.

Related papers

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation [60.66986667921744]
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator.<n>We propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm.<n>This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors.
arXiv Detail & Related papers (2025-11-25T18:54:16Z)
EditInfinity: Image Editing with Binary-Quantized Generative Models [64.05135380710749]
We investigate the parameter-efficient adaptation of binary-quantized generative models for image editing.<n>Specifically, we propose EditInfinity, which adapts emphInfinity, a binary-quantized generative model, for image editing.<n>We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation.
arXiv Detail & Related papers (2025-10-23T05:06:24Z)
Training-Free Reward-Guided Image Editing via Trajectory Optimal Control [55.64204232819136]
We introduce a novel framework for training-free, reward-guided image editing.<n>We demonstrate that our approach significantly outperforms existing inversion-based training-free baselines.
arXiv Detail & Related papers (2025-09-30T06:34:37Z)
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z)
Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models [1.9389881806157316]
In this work, we propose a novel framework that enhances image inversion using consistency models.<n>Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy.<n>We achieve state-of-the-art performance across various image editing tasks and datasets.
arXiv Detail & Related papers (2025-06-23T20:34:43Z)
AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing [33.74477787349966]
We propose a novel one-step point-based image editing method, named AttentionDrag.<n>This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining.<n>Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds.
arXiv Detail & Related papers (2025-06-16T09:42:38Z)
VINCIE: Unlocking In-context Image Editing from Video [62.88977098700917]
In this work, we explore whether an in-context image editing model can be learned directly from videos.<n>To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks.<n>Our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks.
arXiv Detail & Related papers (2025-06-12T17:46:54Z)
Pathways on the Image Manifold: Image Editing via Video Generation [11.891831122571995]
We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit.<n>Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.
arXiv Detail & Related papers (2024-11-25T16:41:45Z)
Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame. We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z)
Task-agnostic Temporally Consistent Facial Video Editing [84.62351915301795]
We propose a task-agnostic, temporally consistent facial video editing framework. Based on a 3D reconstruction model, our framework is designed to handle several editing tasks in a more unified and disentangled manner. Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.
arXiv Detail & Related papers (2020-07-03T02:49:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.