Related papers: Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

URL: http://arxiv.org/abs/2507.01908v1
Date: Wed, 02 Jul 2025 17:22:21 GMT
Title: Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning
Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang,
Abstract summary: Reason50K is a large-scale dataset curated for training and evaluating hypothetical instruction reasoning image editing.<n> ReasonBrain is a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios.<n>Our dataset and code will be released publicly.
Score: 52.873405027439794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

Related papers

ReasonEdit: Editing Vision-Language Models using Human Reasoning [11.662011379565795]
We propose ReasonEdit, the first vision-language model editor to let users explain their reasoning during editing.<n>ReasonEdit stores human reasoning in a codebook, and retrieves only relevant facts during inference.<n>We show that using human reasoning during editing greatly improves edit generalization.
arXiv Detail & Related papers (2026-02-02T18:06:14Z)
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z)
Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning [34.940968264459805]
We introduce a training-free visual-reasoning pipeline for Large Language Models (LLMs)<n>A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain.<n>Our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.
arXiv Detail & Related papers (2025-09-27T14:13:41Z)
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving [57.22004912994658]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z)
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models [27.142703756752997]
We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
arXiv Detail & Related papers (2025-05-20T18:18:01Z)
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing [19.019168402650457]
Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling.<n>We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts.
arXiv Detail & Related papers (2024-12-13T21:15:01Z)
ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing [77.12834553200632]
We introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset. The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images. When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not.
arXiv Detail & Related papers (2024-05-18T06:03:42Z)
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning [10.80288566599934]
HYDRA is a compositional visual reasoning framework for reliable and incrementally progressive general reasoning. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.
arXiv Detail & Related papers (2024-03-19T16:31:30Z)
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing. It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP) We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.