PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning
- URL: http://arxiv.org/abs/2602.22809v2
- Date: Sun, 01 Mar 2026 10:54:03 GMT
- Title: PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning
- Authors: Mingde Yao, Zhiyuan You, King-Man Tam, Menglu Wang, Tianfan Xue,
- Abstract summary: PhotoAgent is a system that advances image editing through explicit aesthetic planning.<n>It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution.<n>In experiments, PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods.
- Score: 26.368648607025676
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://mdyao.github.io/PhotoAgent/.
Related papers
- Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling [69.36546486569146]
Agent Banana is a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing.<n> Context Folding compresses long interaction histories into structured memory for stable long-horizon control.<n>Image Layer Decomposition performs localized layer-based edits to preserve non-target regions.
arXiv Detail & Related papers (2026-02-09T18:59:18Z) - How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - EditThinker: Unlocking Iterative Reasoning for Any Image Editor [72.28251670314451]
We propose a deliberative editing framework to 'think' while they edit.<n>We train a single MLLM, EditThinker, to act as the reasoning engine of this framework.<n>We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements.
arXiv Detail & Related papers (2025-12-05T18:58:09Z) - Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z) - SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow [13.815228931600236]
We introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps.<n>On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators.
arXiv Detail & Related papers (2025-04-13T19:13:04Z) - INRetouch: Context Aware Implicit Neural Representation for Photography Retouching [54.17599183365242]
We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs.<n>We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context.<n>Our method extracts implicit transformations from reference edits and adaptively applies them to new images.
arXiv Detail & Related papers (2024-12-05T03:31:48Z) - PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI.
Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.