Related papers: Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

URL: http://arxiv.org/abs/2409.13637v2
Date: Fri, 27 Dec 2024 02:57:13 GMT
Title: Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation
Authors: Sen Lei, Xinyu Xiao, Tianlin Zhang, Heng-Chao Li, Zhenwei Shi, Qing Zhu,
Abstract summary: We propose a new referring remote sensing image segmentation method to fully exploit the visual and linguistic representations.<n>The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.<n>We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
Score: 27.13782704236074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify ground objects and assign pixel-wise labels within the imagery. The one of key challenges for this task is to capture discriminative multi-modal features via text-image alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this paper, we argue that a ``fine-grained image-text alignment'' can improve the extraction of multi-modal information. To this point, we propose a new referring remote sensing image segmentation method to fully exploit the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into the ground object and spatial position texts. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts, obtaining better discriminative multi-modal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a Text-aware Multi-scale Enhancement Module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed method on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available at https://github.com/Shaosifan/FIANet.

Related papers

Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning [52.92837273570818]
Chinese characters exhibit unique structures and compositional rules, allowing for the use of fine-grained semantic information in representation.<n>We propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm.<n>Our proposed Hi-GITA outperforms existing zero-shot CCR methods.
arXiv Detail & Related papers (2025-05-30T17:39:14Z)
RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding. We introduce a referring remote sensing image segmentation foundational model, RSRefSeg. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z)
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval [37.775529830620016]
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation.
arXiv Detail & Related papers (2024-05-29T10:19:11Z)
CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations. CLIM consistently improves different open-vocabulary object detection methods. It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z)
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning [49.48946808024608]
We propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA. Specifically, the first stage involves preliminary alignment through image-text contrastive learning. In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model.
arXiv Detail & Related papers (2023-12-02T17:32:17Z)
Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression. We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z)
Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy. Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process. We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
Language Guided Local Infiltration for Interactive Image Retrieval [12.324893780690918]
Interactive Image Retrieval (IIR) aims to retrieve images that are generally similar to the reference image but under requested text modification. We propose a Language Guided Local Infiltration (LGLI) system, which fully utilizes the text information and penetrates text features into image features. Our method outperforms most state-of-the-art IIR approaches.
arXiv Detail & Related papers (2023-04-16T10:33:08Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.