Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
- URL: http://arxiv.org/abs/2602.09084v1
- Date: Mon, 09 Feb 2026 18:59:18 GMT
- Title: Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
- Authors: Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu,
- Abstract summary: Agent Banana is a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing.<n> Context Folding compresses long interaction histories into structured memory for stable long-horizon control.<n>Image Layer Decomposition performs localized layer-based edits to preserve non-target regions.
- Score: 69.36546486569146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.
Related papers
- PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning [26.368648607025676]
PhotoAgent is a system that advances image editing through explicit aesthetic planning.<n>It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution.<n>In experiments, PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods.
arXiv Detail & Related papers (2026-02-26T09:46:06Z) - MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing [67.28337411397062]
We introduce the Multi-Layer Document Editing Agent (MiLDEAgent)<n>MiLDEAgent is a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications.<n>MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models.
arXiv Detail & Related papers (2026-01-08T04:38:07Z) - I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing [59.434028565445885]
I2E is a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment.<n>I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions.<n>I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
arXiv Detail & Related papers (2026-01-07T09:29:57Z) - I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models [78.62380562116135]
Existing image editing benchmarks suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations.<n>We propose textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features 10 task categories across both single-image and multi-image editing tasks.<n>Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions.
arXiv Detail & Related papers (2025-12-04T10:44:07Z) - An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing [5.192553173010677]
RefineEdit-Agent is a novel, training-free intelligent agent framework for complex, iterative, and context-aware image editing.<n>Our framework comprises an LVI-driven instruction and scene understanding module, a multi-level editing planner, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop.
arXiv Detail & Related papers (2025-08-24T16:28:18Z) - Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - Marmot: Object-Level Self-Correction via Multi-Agent Reasoning [55.74860093731475]
Marmot is a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
arXiv Detail & Related papers (2025-04-10T16:54:28Z) - GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing [60.09562648953926]
GenArtist is a unified image generation and editing system coordinated by a multimodal large language model (MLLM) agent.
We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution.
Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-07-08T04:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.