Related papers: Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension

Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension

URL: http://arxiv.org/abs/2410.01544v2
Date: Fri, 22 Nov 2024 04:32:55 GMT
Title: Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension
Authors: Zaiquan Yang, Yuhao Liu, Jiaying Lin, Gerhard Hancke, Rynson W. H. Lau,
Abstract summary: This paper focuses on a challenging setup where target localization is learned directly from image-text pairs. We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object. Our method outperforms SOTA methods on three common benchmarks.
Score: 40.21084218601082
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper explores the weakly-supervised referring image segmentation (WRIS) problem, and focuses on a challenging setup where target localization is learned directly from image-text pairs. We note that the input text description typically already contains detailed information on how to localize the target object, and we also observe that humans often follow a step-by-step comprehension process (\ie, progressively utilizing target-related attributes and relations as cues) to identify the target object. Hence, we propose a novel Progressive Comprehension Network (PCNet) to leverage target-related textual cues from the input description for progressively localizing the target object. Specifically, we first use a Large Language Model (LLM) to decompose the input text description into short phrases. These short phrases are taken as target-related cues and fed into a Conditional Referring Module (CRM) in multiple stages, to allow updating the referring text embedding and enhance the response map for target localization in a multi-stage manner. Based on the CRM, we then propose a Region-aware Shrinking (RaS) loss to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages. Finally, we introduce an Instance-aware Disambiguation (IaD) loss to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image. Extensive experiments show that our method outperforms SOTA methods on three common benchmarks.

Related papers

Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z)
Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection [102.1314414263959]
Infrared small target detection remains challenging due to limited feature representation and severe background interference.<n>We propose DGSPNet, an end-to-end language prompt-driven framework.<n>Our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2025-11-24T16:58:23Z)
Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization [22.58434223222062]
We propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations. We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets.
arXiv Detail & Related papers (2025-04-18T04:35:35Z)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z)
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation. We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z)
Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts. We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z)
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation. We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z)
Referring Image Segmentation Using Text Supervision [44.27304699305985]
Existing Referring Image (RIS) methods typically require expensive pixel-level or box-level annotations for supervision. We propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process. Our framework achieves promising performances to existing fully-supervised RIS methods while outperforming state-of-the-art weakly-supervised methods adapted from related areas.
arXiv Detail & Related papers (2023-08-28T13:40:47Z)
Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression. We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z)
CLIP the Gap: A Single Domain Generalization Approach for Object Detection [60.20931827772482]
Single Domain Generalization tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. We propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss.
arXiv Detail & Related papers (2023-01-13T12:01:18Z)
Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression. We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps. Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z)
Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z)
Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask. We present a "Then-Then-Segment" scheme to tackle these problems. Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.