Related papers: Exploring Text-Guided Single Image Editing for Remote Sensing Images

Related papers

Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling [54.54513714247062]
Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework.<n>We found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions.<n>We propose Self-Adaptive Attention Scaling to dynamically scale the attention activation for each sub-instruction.
arXiv Detail & Related papers (2025-07-22T05:25:38Z)
Training-Free Text-Guided Image Editing with Visual Autoregressive Model [46.201510044410995]
We propose a novel text-guided image editing framework based on Visual AutoRegressive modeling. Our method eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds.
arXiv Detail & Related papers (2025-03-31T09:46:56Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models. Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z)
UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training. Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC) CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z)
HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads [39.94688771600168]
Head is a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. We present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression.
arXiv Detail & Related papers (2024-11-22T16:08:03Z)
Multi-task SAR Image Processing via GAN-based Unsupervised Manipulation [6.154796320245652]
Generative Adversarial Networks (GANs) have shown tremendous potential in synthesizing a large number of realistic SAR images. This paper proposes a novel SAR image processing framework called GAN-based Unsupervised Editing (GUE)
arXiv Detail & Related papers (2024-08-02T19:49:30Z)
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts. We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z)
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption. We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations. We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training [61.984277261016146]
We propose a CustomNeRF model that unifies a text description or a reference image as the editing prompt. To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing. For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem.
arXiv Detail & Related papers (2023-12-04T06:25:06Z)
Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images. Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance. We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z)
Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval [14.986283867293048]
Zero-shot composed image retrieval (ZS-CIR) takes a textual modification and a reference image as a query to retrieve a target image without triplet labeling.<n>Current ZS-CIR research mainly relies on the generalization ability of pre-trained vision-language models.<n>We introduce a novel unlabeled and pre-trained masked tuning approach, which reduces the gap between the pre-trained vision-language model and the downstream CIR task.
arXiv Detail & Related papers (2023-11-13T02:49:57Z)
Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. Recent research sidesteps this need by using large-scale vision-language models (VLMs) We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z)
iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing. It generates images conditioned on a source image and a textual edit prompt. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
ResiDualGAN: Resize-Residual DualGAN for Cross-Domain Remote Sensing Images Semantic Segmentation [15.177834801688979]
The performance of a semantic segmentation model for remote sensing (RS) images pretrained on an annotated dataset would greatly decrease when testing on another unannotated dataset because of the domain gap. Adversarial generative methods, e.g., DualGAN, are utilized for unpaired image-to-image translation to minimize the pixel-level domain gap. In this paper, ResiDualGAN is proposed for RS images translation, where a resizer module is used for addressing the scale discrepancy of RS datasets.
arXiv Detail & Related papers (2022-01-27T13:56:54Z)
Unleashing the Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification [10.045028405219641]
We design an Unsupervised Pre-training framework for ReID based on the contrastive learning (CL) pipeline, dubbed UP-ReID. We introduce an intra-identity (I$2$-)regularization in the UP-ReID, which is instantiated as two constraints coming from global image aspect and local patch aspect. Our UP-ReID pre-trained model can significantly benefit the downstream ReID fine-tuning and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-12-01T07:16:37Z)
Learning by Planning: Language-Guided Global Image Editing [53.72807421111136]
We develop a text-to-operation model to map the vague editing language request into a series of editing operations. The only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. We propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth.
arXiv Detail & Related papers (2021-06-24T16:30:03Z)
SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning [79.30956389694184]
Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative instructions to edit images step by step. Data scarcity is a significant issue for ILBIE as it is challenging to collect large-scale examples of images before and after instruction-based changes. We introduce a Self-Supervised Counterfactual Reasoning framework that incorporates counterfactual thinking to overcome data scarcity.
arXiv Detail & Related papers (2020-09-21T01:45:58Z)
Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones. Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images. We pro-pose a real scene text SR dataset, termed TextZoom. It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.