Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2501.17642v1
- Date: Wed, 29 Jan 2025 13:24:53 GMT
- Title: Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
- Authors: Lin Chen, Qi Yang, Kun Ding, Zhihao Li, Gang Shen, Fei Li, Qiyuan Cao, Shiming Xiang,
- Abstract summary: Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions.<n>Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities.<n>This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency.
- Score: 36.46163240168576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities, significantly facilitating the development of OVSS. However, most existing methods suffer from either suboptimal performance or long latency. This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency. ERR-Seg incorporates a training-free Channel Reduction Module (CRM) that leverages prior knowledge from vision-language models like CLIP to identify the most relevant classes while discarding others. Moreover, it incorporates Efficient Semantic Context Fusion (ESCF) with spatial-level and class-level sequence reduction strategies. CRM and ESCF result in substantial memory and computational savings without compromising accuracy. Additionally, recognizing the significance of hierarchical semantics extracted from middle-layer features for closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves +$5.6\%$ mIoU improvement and reduces latency by $67.3\%$.
Related papers
- CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass [3.0566617373924325]
Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field.
We propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models.
We show that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption.
arXiv Detail & Related papers (2025-05-01T08:27:14Z) - Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.
Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.
Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z) - LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding [4.759109475818876]
Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains.
We introduce LIFT, a novel, high-performance framework that captures multiscale information through meta-learning.
We also introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings.
arXiv Detail & Related papers (2025-03-19T17:00:58Z) - Interpreting CLIP with Hierarchical Sparse Autoencoders [8.692675181549117]
Matryoshka SAE (MSAE) learns hierarchical representations at multiple granularities simultaneously.
MSAE establishes a new state-of-the-art frontier between reconstruction quality and sparsity for CLIP.
arXiv Detail & Related papers (2025-02-27T22:39:13Z) - Uncertainty-Participation Context Consistency Learning for Semi-supervised Semantic Segmentation [9.546065701435532]
Semi-supervised semantic segmentation has attracted considerable attention for its ability to mitigate the reliance on extensive labeled data.<n>This paper proposes the Uncertainty-participation Context Consistency Learning (UCCL) method to explore richer supervisory signals.
arXiv Detail & Related papers (2024-12-23T06:49:59Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - IRS-Enhanced Secure Semantic Communication Networks: Cross-Layer and Context-Awared Resource Allocation [30.000606717755833]
The challenge of eavesdropping poses a formidable threat to semantic privacy due to the open nature of wireless communications.
In this paper, intelligent reflective surface (IRS)-enhanced secure semantic communication (IRS-SSC) is proposed to guarantee the physical layer security from a task-oriented semantic perspective.
We propose a novel semantic awared state space (SCA-SS) to fusion the high-dimensional semantic space and the observable system state space.
arXiv Detail & Related papers (2024-11-04T05:40:30Z) - IncSAR: A Dual Fusion Incremental Learning Framework for SAR Target Recognition [13.783950035836593]
IncSAR is an incremental learning framework designed to tackle catastrophic forgetting in target recognition.<n>To mitigate the speckle noise inherent in SAR images, we employ a denoising module based on a neural network approximation.<n>Experiments on the MSTAR, SAR-AIRcraft-1.0, and OpenSARShip benchmark datasets demonstrate that IncSAR significantly outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2024-10-08T08:49:47Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC)
SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.
We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - SLLEN: Semantic-aware Low-light Image Enhancement Network [92.80325772199876]
We develop a semantic-aware LLE network (SSLEN) composed of a LLE main-network (LLEmN) and a SS auxiliary-network (SSaN)
Unlike currently available approaches, the proposed SLLEN is able to fully lever the semantic information, e.g., IEF, HSF, and SS dataset, to assist LLE.
Comparisons between the proposed SLLEN and other state-of-the-art techniques demonstrate the superiority of SLLEN with respect to LLE quality.
arXiv Detail & Related papers (2022-11-21T15:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.