PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching
- URL: http://arxiv.org/abs/2511.12998v1
- Date: Mon, 17 Nov 2025 05:39:15 GMT
- Title: PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching
- Authors: Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu, Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li,
- Abstract summary: We propose a unified diffusion-based image retouching framework called PerTouch.<n>Our method supports semantic-level image retouching while maintaining global aesthetics.<n>We develop a VLM-driven agent that can handle both strong and weak user instructions.
- Score: 54.3683137773426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms in the training process. To connect natural language instructions with visual control, we develop a VLM-driven agent that can handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch better aligns with user intent and captures long-term preferences. Extensive experiments demonstrate each component's effectiveness and the superior performance of PerTouch in personalized image retouching. Code is available at: https://github.com/Auroral703/PerTouch.
Related papers
- BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling [29.77085426345252]
Face retouching requires removing subtle imperfections while preserving unique facial identity features, in order to enhance overall aesthetic appeal.<n>Existing methods suffer from a fundamental trade-off. Supervised learning on labeled data is constrained to pixel-level label mimicry, failing to capture complex subjective human aesthetic preferences.<n>We propose BeautyGRPO, a reinforcement learning framework that aligns face retouching with human aesthetic preferences.
arXiv Detail & Related papers (2026-03-01T15:59:31Z) - ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding [44.20713526887855]
We propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent parameter spaces.<n>Our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation.
arXiv Detail & Related papers (2026-02-02T09:53:45Z) - RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models [76.79706360982162]
We propose RetouchLLM, a training-free white-box image retouching system.<n>It performs interpretable, code-based retouching directly on high-resolution images.<n>Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching.
arXiv Detail & Related papers (2025-10-09T10:40:49Z) - ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z) - DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition [69.10628479553709]
We introduce DRC, a novel personalized image generation framework that enhances Large Multimodal Models (LMMs)<n> DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively.<n>It involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation.
arXiv Detail & Related papers (2025-04-24T08:10:10Z) - SPF-Portrait: Towards Pure Text-to-Portrait Customization with Semantic Pollution-Free Fine-Tuning [33.709835660394305]
SPF-Portrait is a pioneering work to purely understand customized target semantics and minimize disruption to the original model.<n>In our SPF-Portrait, we design a dual-path contrastive learning pipeline, which introduces the original model as a behavioral alignment reference.<n>It adaptively balances the behavioral alignment across different regions and the responsiveness of the target semantics.
arXiv Detail & Related papers (2025-04-01T03:37:30Z) - DiffRetouch: Using Diffusion to Retouch on the Shoulder of Experts [45.730449182899754]
diffusion-based retouching method named DiffRetouch.
Four image attributes are made adjustable to provide a user-friendly editing mechanism.
Affine bilateral grid and contrastive learning scheme are introduced to handle the problem of texture distortion and control insensitivity respectively.
arXiv Detail & Related papers (2024-07-04T09:09:42Z) - Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space
Navigation [136.53288628437355]
Controllable semantic image editing enables a user to change entire image attributes with few clicks.
Current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism.
We propose quantitative evaluation strategies for measuring controllable editing performance, unlike prior work which primarily focuses on qualitative evaluation.
arXiv Detail & Related papers (2021-02-01T21:38:36Z) - Controllable Image Synthesis via SegVAE [89.04391680233493]
A semantic map is commonly used intermediate representation for conditional image generation.
In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories.
The proposed framework, SegVAE, synthesizes semantic maps in an iterative manner using conditional variational autoencoder.
arXiv Detail & Related papers (2020-07-16T15:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.