Region in Context: Text-condition Image editing with Human-like semantic reasoning
- URL: http://arxiv.org/abs/2510.16772v1
- Date: Sun, 19 Oct 2025 09:36:02 GMT
- Title: Region in Context: Text-condition Image editing with Human-like semantic reasoning
- Authors: Thuy Phuong Vu, Dinh-Cuong Hoang, Minhhuy Le, Phan Xuan Tan,
- Abstract summary: Region in Context is a novel framework for text-conditioned image editing.<n>It performs multilevel semantic alignment between vision and language.<n>Our method encourages each region to understand its role within the global image context.
- Score: 0.7233065479782753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research has made significant progress in localizing and editing image regions based on text. However, most approaches treat these regions in isolation, relying solely on local cues without accounting for how each part contributes to the overall visual and semantic composition. This often results in inconsistent edits, unnatural transitions, or loss of coherence across the image. In this work, we propose Region in Context, a novel framework for text-conditioned image editing that performs multilevel semantic alignment between vision and language, inspired by the human ability to reason about edits in relation to the whole scene. Our method encourages each region to understand its role within the global image context, enabling precise and harmonized changes. At its core, the framework introduces a dual-level guidance mechanism: regions are represented with full-image context and aligned with detailed region-level descriptions, while the entire image is simultaneously matched to a comprehensive scene-level description generated by a large vision-language model. These descriptions serve as explicit verbal references of the intended content, guiding both local modifications and global structure. Experiments show that it produces more coherent and instruction-aligned results. Code is available at: https://github.com/thuyvuphuong/Region-in-Context.git
Related papers
- Global-Local Aware Scene Text Editing [18.390088100986286]
Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving the original text style and background texture.<n>Existing methods suffer from two major challenges: inconsistency and length-insensitivity.<n>We propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE)
arXiv Detail & Related papers (2025-12-03T08:56:01Z) - TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models [16.64400658301794]
TextRegion is a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2.<n>These tokens enable detailed visual understanding while preserving open-vocabulary capabilities.
arXiv Detail & Related papers (2025-05-29T17:59:59Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - Region-Aware Diffusion for Zero-shot Text-driven Image Editing [78.58917623854079]
We propose a novel region-aware diffusion model (RDM) for entity-level image editing.
To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline.
The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency.
arXiv Detail & Related papers (2023-02-23T06:20:29Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - Talk-to-Edit: Fine-Grained Facial Editing via Dialog [79.8726256912376]
Talk-to-Edit is an interactive facial editing framework that performs fine-grained attribute manipulation through dialog between the user and the system.
Our key insight is to model a continual "semantic field" in the GAN latent space.
Our system generates language feedback by considering both the user request and the current state of the semantic field.
arXiv Detail & Related papers (2021-09-09T17:17:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.