From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors
- URL: http://arxiv.org/abs/2602.21778v2
- Date: Fri, 27 Feb 2026 13:42:32 GMT
- Title: From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors
- Authors: Liangbing Zhao, Le Zhuo, Sayak Paul, Hongsheng Li, Mohamed Elhoseiny,
- Abstract summary: We introduce PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism.<n> Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing.
- Score: 62.96515611323478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.
Related papers
- InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models [17.680767010203308]
We introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing.<n>InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation.<n>Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain.
arXiv Detail & Related papers (2026-03-04T02:24:43Z) - ChordEdit: One-Step Low-Energy Transport for Image Editing [8.517302920663932]
ChordEdit is a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing.<n>We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts.<n>A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits.
arXiv Detail & Related papers (2026-02-22T07:40:50Z) - I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing [59.434028565445885]
I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
arXiv Detail & Related papers (2026-01-07T09:29:57Z) - MotionEdit: Benchmarking and Learning Motion-Centric Image Editing [81.28392925790568]
We introduce MotionEdit, a novel dataset for motion-centric image editing.<n>MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted from continuous videos.<n>We propose MotionNFT to compute motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion.
arXiv Detail & Related papers (2025-12-11T04:53:58Z) - Are Image-to-Video Models Good Zero-Shot Image Editors? [39.10187156757937]
We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing.<n>IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames.
arXiv Detail & Related papers (2025-11-24T18:59:54Z) - Training-free Geometric Image Editing on Diffusion Models [53.38549950608886]
We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped.<n>We propose a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement.<n>Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine.
arXiv Detail & Related papers (2025-07-31T07:36:00Z) - Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance [27.1886214162329]
Follow Your Motion is a generic framework for maintaining temporal consistency in portrait editing.<n>To maintain fine-grained expression temporal consistency in talking head editing, we propose a dynamic re-weighted attention mechanism.
arXiv Detail & Related papers (2025-03-28T08:18:05Z) - Stable Flow: Vital Layers for Training-Free Image Editing [74.52248787189302]
Diffusion models have revolutionized the field of content synthesis and editing.<n>Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT)<n>We propose an automatic method to identify "vital layers" within DiT, crucial for image formation.<n>Next, to enable real-image editing, we introduce an improved image inversion method for flow models.
arXiv Detail & Related papers (2024-11-21T18:59:51Z) - HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner.
The first stage focuses on object swapping in a single frame with HOI awareness.
The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.