Language Guided Local Infiltration for Interactive Image Retrieval
- URL: http://arxiv.org/abs/2304.07747v1
- Date: Sun, 16 Apr 2023 10:33:08 GMT
- Title: Language Guided Local Infiltration for Interactive Image Retrieval
- Authors: Fuxiang Huang and Lei Zhang
- Abstract summary: Interactive Image Retrieval (IIR) aims to retrieve images that are generally similar to the reference image but under requested text modification.
We propose a Language Guided Local Infiltration (LGLI) system, which fully utilizes the text information and penetrates text features into image features.
Our method outperforms most state-of-the-art IIR approaches.
- Score: 12.324893780690918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interactive Image Retrieval (IIR) aims to retrieve images that are generally
similar to the reference image but under the requested text modification. The
existing methods usually concatenate or sum the features of image and text
simply and roughly, which, however, is difficult to precisely change the local
semantics of the image that the text intends to modify. To solve this problem,
we propose a Language Guided Local Infiltration (LGLI) system, which fully
utilizes the text information and penetrates text features into image features
as much as possible. Specifically, we first propose a Language Prompt Visual
Localization (LPVL) module to generate a localization mask which explicitly
locates the region (semantics) intended to be modified. Then we introduce a
Text Infiltration with Local Awareness (TILA) module, which is deployed in the
network to precisely modify the reference image and generate image-text
infiltrated representation. Extensive experiments on various benchmark
databases validate that our method outperforms most state-of-the-art IIR
approaches.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images.
We achieve our goal by leveraging and combining a pretrained language-image model (CLIP)
To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z) - Integrating Image Captioning with Rule-based Entity Masking [23.79124007406315]
We propose a novel framework for the image captioning with an explicit object (e.g., knowledge graph entity) selection process.
The model first explicitly selects which local entities to include in the caption according to a human-interpretable mask, then generate proper captions by attending to selected entities.
arXiv Detail & Related papers (2020-07-22T21:27:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.