Related papers: Zero-shot Referring Image Segmentation with Global-Local Context Features

Zero-shot Referring Image Segmentation with Global-Local Context Features

URL: http://arxiv.org/abs/2303.17811v2
Date: Mon, 3 Apr 2023 08:58:36 GMT
Title: Zero-shot Referring Image Segmentation with Global-Local Context Features
Authors: Seonghoon Yu, Paul Hongsuck Seo, Jeany Son
Abstract summary: Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. We propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins.
Score: 8.77461711080319
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed Istance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS.

Related papers

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model [61.97017867656831]
We introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. We construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.
arXiv Detail & Related papers (2025-04-13T16:36:47Z)
HARIS: Human-Like Attention for Reference Image Segmentation [5.808325471170541]
We propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism. Our method achieves state-of-the-art performance and great zero-shot ability.
arXiv Detail & Related papers (2024-05-17T11:29:23Z)
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework. TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing. The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z)
Learning Semantic Segmentation with Query Points Supervision on Aerial Images [57.09251327650334]
We present a weakly supervised learning algorithm to train semantic segmentation algorithms. Our proposed approach performs accurate semantic segmentation and improves efficiency by significantly reducing the cost and time required for manual annotation.
arXiv Detail & Related papers (2023-09-11T14:32:04Z)
Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions [22.090074821554754]
We propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background. On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
arXiv Detail & Related papers (2023-08-25T05:00:05Z)
Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression. We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps. Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z)
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs [10.484851004093919]
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts. We propose a novel Text-grounded Contrastive Learning framework that enables a model to directly learn region-text alignment.
arXiv Detail & Related papers (2022-12-01T18:59:03Z)
Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation [75.00151934315967]
MaskDistill is a novel framework for unsupervised semantic segmentation. Our framework does not latch onto low-level image cues and is not limited to object-centric datasets.
arXiv Detail & Related papers (2022-06-13T17:59:43Z)
Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z)
Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask. We present a "Then-Then-Segment" scheme to tackle these problems. Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.