Video4Edit: Viewing Image Editing as a Degenerate Temporal Process
- URL: http://arxiv.org/abs/2511.18131v1
- Date: Sat, 22 Nov 2025 17:30:55 GMT
- Title: Video4Edit: Viewing Image Editing as a Degenerate Temporal Process
- Authors: Xiaofan Li, Yanpeng Sun, Chenming Wu, Fan Duan, YuAn Wang, Weihao Bo, Yumeng Zhang, Dingkang Liang,
- Abstract summary: multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime.<n>We revisit this challenge through the lens of temporal modeling.<n>This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime.
- Score: 24.8621496006791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We observe that recent advances in multimodal foundation models have propelled instruction-driven image generation and editing into a genuinely cross-modal, cooperative regime. Nevertheless, state-of-the-art editing pipelines remain costly: beyond training large diffusion/flow models, they require curating massive high-quality triplets of \{instruction, source image, edited image\} to cover diverse user intents. Moreover, the fidelity of visual replacements hinges on how precisely the instruction references the target semantics. We revisit this challenge through the lens of temporal modeling: if video can be regarded as a full temporal process, then image editing can be seen as a degenerate temporal process. This perspective allows us to transfer single-frame evolution priors from video pre-training, enabling a highly data-efficient fine-tuning regime. Empirically, our approach matches the performance of leading open-source baselines while using only about one percent of the supervision demanded by mainstream editing models.
Related papers
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation [60.66986667921744]
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator.<n>We propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm.<n>This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors.
arXiv Detail & Related papers (2025-11-25T18:54:16Z) - EditInfinity: Image Editing with Binary-Quantized Generative Models [64.05135380710749]
We investigate the parameter-efficient adaptation of binary-quantized generative models for image editing.<n>Specifically, we propose EditInfinity, which adapts emphInfinity, a binary-quantized generative model, for image editing.<n>We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation.
arXiv Detail & Related papers (2025-10-23T05:06:24Z) - Training-Free Reward-Guided Image Editing via Trajectory Optimal Control [55.64204232819136]
We introduce a novel framework for training-free, reward-guided image editing.<n>We demonstrate that our approach significantly outperforms existing inversion-based training-free baselines.
arXiv Detail & Related papers (2025-09-30T06:34:37Z) - EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z) - Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models [1.9389881806157316]
In this work, we propose a novel framework that enhances image inversion using consistency models.<n>Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy.<n>We achieve state-of-the-art performance across various image editing tasks and datasets.
arXiv Detail & Related papers (2025-06-23T20:34:43Z) - AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing [33.74477787349966]
We propose a novel one-step point-based image editing method, named AttentionDrag.<n>This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining.<n>Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds.
arXiv Detail & Related papers (2025-06-16T09:42:38Z) - VINCIE: Unlocking In-context Image Editing from Video [62.88977098700917]
In this work, we explore whether an in-context image editing model can be learned directly from videos.<n>To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks.<n>Our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks.
arXiv Detail & Related papers (2025-06-12T17:46:54Z) - Pathways on the Image Manifold: Image Editing via Video Generation [11.891831122571995]
We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit.<n>Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.
arXiv Detail & Related papers (2024-11-25T16:41:45Z) - Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing.
Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame.
We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z) - Task-agnostic Temporally Consistent Facial Video Editing [84.62351915301795]
We propose a task-agnostic, temporally consistent facial video editing framework.
Based on a 3D reconstruction model, our framework is designed to handle several editing tasks in a more unified and disentangled manner.
Compared with the state-of-the-art facial image editing methods, our framework generates video portraits that are more photo-realistic and temporally smooth.
arXiv Detail & Related papers (2020-07-03T02:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.