Related papers: HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

URL: http://arxiv.org/abs/2506.13925v2
Date: Wed, 13 Aug 2025 07:32:09 GMT
Title: HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
Authors: Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais,
Abstract summary: We propose a unified Vision-Language framework that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network.<n>Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.
Score: 16.926158907882012
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we address Semi-supervised Semantic Segmentation (SSS) under domain shift by leveraging domain-invariant semantic knowledge from text embeddings of Vision-Language Models (VLMs). We propose a unified Hierarchical Vision-Language framework (HVL) that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network to improve generalization and reduce misclassification under limited supervision. The mentioned textual queries are used for grouping pixels with shared semantics under SSS. HVL is designed to (1) generate textual queries that maximally encode domain-invariant semantics from VLM while capturing intra-class variations; (2) align these queries with spatial visual features to enhance their segmentation ability and improve the semantic clarity of visual features. We also introduce targeted regularization losses that maintain vision--language alignment throughout training to reinforce semantic understanding. HVL establishes a novel state-of-the-art by achieving a +9.3% improvement in mean Intersection over Union (mIoU) on COCO, utilizing 232 labelled images, +3.1% on Pascal VOC employing 92 labels, +4.8% on ADE20 using 316 labels, and +3.4% on Cityscapes with 100 labels, demonstrating superior performance with less than 1% supervision on four benchmark datasets. Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.

Related papers

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation [6.56750055693609]
Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture.<n>Existing deep learning models struggle to generalize due to reliance on dataset-specific visual features.<n>We propose Vision-Language Weed (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations.
arXiv Detail & Related papers (2026-02-27T04:53:18Z)
LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation [12.192429756057132]
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories.<n>LoGoSeg integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context.
arXiv Detail & Related papers (2026-02-05T12:03:11Z)
Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z)
FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization [14.90727017126931]
Federated semantic segmentation enables pixel-level classification in images through collaborative learning.<n>We propose a novel framework that strikes class consistency, termed FedSaaS.<n>Our framework significantly improves average segmentation accuracy and effectively addresses the class-consistency representation problem.
arXiv Detail & Related papers (2025-05-14T13:38:30Z)
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation [9.759008308251127]
Zero-shot referring image segmentation aims to locate and segment the target region based on a referring expression.<n>Previous works address this challenge by utilizing Vision-Language Models and mask proposal networks for region-text matching.<n>We present LGD (Leveraging Generative Descriptions), a framework that utilizes the advanced language generation capabilities of Multi-Modal Large Language Models.
arXiv Detail & Related papers (2025-04-20T02:51:11Z)
Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation [82.95830628372845]
This paper introduces a collaborative vision-text optimizing mechanism within the Open-Vocabulary encoder (OVS) field.<n>To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field.<n>In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively.
arXiv Detail & Related papers (2024-08-01T17:48:08Z)
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting [12.770499009990864]
We propose a visual Semantic-Spatial Self-Highlighting Network (termed 3SHNet) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality. Experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet.
arXiv Detail & Related papers (2024-04-26T09:25:18Z)
Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information. We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z)
A Lightweight Clustering Framework for Unsupervised Semantic Segmentation [28.907274978550493]
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. We propose a lightweight clustering framework for unsupervised semantic segmentation. Our framework achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2023-11-30T15:33:42Z)
SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance [97.00445262074595]
In SemiVL, we propose to integrate rich priors from vision-language models into semi-supervised semantic segmentation. We design a language-guided decoder to jointly reason over vision and language. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods.
arXiv Detail & Related papers (2023-11-27T19:00:06Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks.<n>$pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency.<n>As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z)
Removing supervision in semantic segmentation with local-global matching and area balancing [0.0]
We design a novel end-to-end model leveraging local-global patch matching to predict categories, good localization, area and shape of objects for semantic segmentation. Our model attains state-of-the-art in Weakly Supervised Semantic, only image-level labels, with 75% mIoU on PascalVOC2012 val set and 46% on MS-COCO2014 val set. We also attain state-of-the-art on Unsupervised Semantic with 43.6% mIoU on PascalVOC2012 val set and 19.4% on MS-COCO2014 val set.
arXiv Detail & Related papers (2023-03-30T14:27:42Z)
Deep Hierarchical Semantic Segmentation [76.40565872257709]
hierarchical semantic segmentation (HSS) aims at structured, pixel-wise description of visual observation in terms of a class hierarchy. HSSN casts HSS as a pixel-wise multi-label classification task, only bringing minimal architecture change to current segmentation models. With hierarchy-induced margin constraints, HSSN reshapes the pixel embedding space, so as to generate well-structured pixel representations.
arXiv Detail & Related papers (2022-03-27T15:47:44Z)
Scaling up Multi-domain Semantic Segmentation with Sentence Embeddings [81.09026586111811]
We propose an approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting. This is achieved by replacing each class label with a vector-valued embedding of a short paragraph that describes the class. The resulting merged semantic segmentation dataset of over 2 Million images enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets.
arXiv Detail & Related papers (2022-02-04T07:19:09Z)
Hierarchical Modular Network for Video Captioning [162.70349114104107]
We propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions. The proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
arXiv Detail & Related papers (2021-11-24T13:07:05Z)
HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning [74.76431541169342]
Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. We propose a novel hierarchical semantic-visual adaptation (HSVA) framework to align semantic and visual domains. Experiments on four benchmark datasets demonstrate HSVA achieves superior performance on both conventional and generalized ZSL.
arXiv Detail & Related papers (2021-09-30T14:27:50Z)
Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation [25.070027668717422]
Generalized zero-shot semantic segmentation (GZS3) predicts pixel-wise semantic labels for seen and unseen classes. Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones. We propose a discriminative approach to address limitations in a unified framework.
arXiv Detail & Related papers (2021-08-14T13:33:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.