Related papers: JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

URL: http://arxiv.org/abs/2506.17612v1
Date: Sat, 21 Jun 2025 06:36:00 GMT
Title: JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
Authors: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan,
Abstract summary: Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity.<n>We introduce JarvisArt, a multi-modal language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom.<n>To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits.<n>JarvisArt outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities.
Score: 74.64342043677975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.

Related papers

LensCraft: Your Professional Virtual Cinematographer [12.512681517449868]
Digital creators, from indie filmmakers to animation studios, face a persistent bottleneck: translating their creative vision into precise camera movements.<n>LensCraft solves this problem by mimicking the expertise of a professional cinematographer, using a data-driven approach.<n>LensCraft achieves markedly lower computational complexity and faster inference while maintaining high output quality.
arXiv Detail & Related papers (2025-06-01T12:43:55Z)
PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents [28.44728600512551]
PhotoArtAgent is an intelligent interpretative system that emulates the creative process of a professional artist.<n>PhotoArtAgent provides transparent, text-based explanations of its creative rationale, fostering meaningful interaction and user control.<n> Experimental results show that PhotoArtAgent not only surpasses existing automated tools in user studies but also achieves results comparable to those of professional human artists.
arXiv Detail & Related papers (2025-05-29T06:00:51Z)
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills [37.48977077142813]
We show that a multimodal large language model (MLLM) can be taught to critique raw photographs.<n>We demonstrate that MLLMs can be first made aware of the underlying image processing operations.<n>We then synthesize a reasoning dataset by procedurally manipulating expert-edited photos.
arXiv Detail & Related papers (2025-05-09T16:38:27Z)
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents [67.31920821192323]
We introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create scenes populated with objects.<n>In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation.<n>Our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions.
arXiv Detail & Related papers (2025-02-21T17:18:30Z)
INRetouch: Context Aware Implicit Neural Representation for Photography Retouching [54.17599183365242]
We propose a novel retouch transfer approach that learns from professional edits through before-after image pairs.<n>We develop a context-aware Implicit Neural Representation that learns to apply edits adaptively based on image content and context.<n>Our method extracts implicit transformations from reference edits and adaptively applies them to new images.
arXiv Detail & Related papers (2024-12-05T03:31:48Z)
SPIRE: Semantic Prompt-Driven Image Restoration [66.26165625929747]
We develop SPIRE, a Semantic and restoration Prompt-driven Image Restoration framework. Our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength. Our experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts.
arXiv Detail & Related papers (2023-12-18T17:02:30Z)
LightPainter: Interactive Portrait Relighting with Freehand Scribble [79.95574780974103]
We introduce LightPainter, a scribble-based relighting system that allows users to interactively manipulate portrait lighting effect with ease. To train the relighting module, we propose a novel scribble simulation procedure to mimic real user scribbles. We demonstrate high-quality and flexible portrait lighting editing capability with both quantitative and qualitative experiments.
arXiv Detail & Related papers (2023-03-22T23:17:11Z)
NICER: Aesthetic Image Enhancement with Humans in the Loop [0.7756211500979312]
This work proposes a neural network based approach to no-reference image enhancement in a fully-, semi-automatic or fully manual process. We show that NICER can improve image aesthetics without user interaction and that allowing user interaction leads to diverse enhancement outcomes.
arXiv Detail & Related papers (2020-12-03T09:14:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.