DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2601.20064v1
- Date: Tue, 27 Jan 2026 21:15:10 GMT
- Title: DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation
- Authors: Zhen Yao, Xin Li, Taotao Jing, Shuai Zhang, Mooi Choo Chuah,
- Abstract summary: Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels.<n>Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction.<n>We introduce DiSa, a novel saliency-aware foreground-background disentangled framework.
- Score: 16.57245702815661
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.
Related papers
- GS: Generative Segmentation via Label Diffusion [59.380173266566715]
Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions.<n>Recent diffusion models have been introduced to this domain, but existing approaches remain image-centric.<n>We propose GS (Generative Label), a novel framework that formulates segmentation itself as a generative task.<n> Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
arXiv Detail & Related papers (2025-08-27T16:28:15Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation [63.31007867379312]
Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions.<n>A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information.<n>In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information.<n>We propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation.
arXiv Detail & Related papers (2025-01-01T15:47:04Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior [62.04939047885834]
We present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for Real-ISR.<n>Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed spatial-CLIP Map.
arXiv Detail & Related papers (2024-11-27T15:22:44Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation [26.667974865352708]
MROVSeg is a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone.<n>It uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - A Simple Framework for Open-Vocabulary Zero-Shot Segmentation [50.58626342189163]
SimZSS is a framework for open-vocabulary Zero-Shot segmentation.<n>It exploits the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions.<n>SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
arXiv Detail & Related papers (2024-06-23T11:57:08Z) - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation.
FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z) - EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training.
Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps.
In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.