Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2501.17642v1
- Date: Wed, 29 Jan 2025 13:24:53 GMT
- Title: Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
- Authors: Lin Chen, Qi Yang, Kun Ding, Zhihao Li, Gang Shen, Fei Li, Qiyuan Cao, Shiming Xiang,
- Abstract summary: Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions.
Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities.
This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency.
- Score: 36.46163240168576
- License:
- Abstract: Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities, significantly facilitating the development of OVSS. However, most existing methods suffer from either suboptimal performance or long latency. This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency. ERR-Seg incorporates a training-free Channel Reduction Module (CRM) that leverages prior knowledge from vision-language models like CLIP to identify the most relevant classes while discarding others. Moreover, it incorporates Efficient Semantic Context Fusion (ESCF) with spatial-level and class-level sequence reduction strategies. CRM and ESCF result in substantial memory and computational savings without compromising accuracy. Additionally, recognizing the significance of hierarchical semantics extracted from middle-layer features for closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves +$5.6\%$ mIoU improvement and reduces latency by $67.3\%$.
Related papers
- Uncertainty-Participation Context Consistency Learning for Semi-supervised Semantic Segmentation [9.546065701435532]
Semi-supervised semantic segmentation has attracted considerable attention for its ability to mitigate the reliance on extensive labeled data.
This paper proposes the Uncertainty-participation Context Consistency Learning (UCCL) method to explore richer supervisory signals.
arXiv Detail & Related papers (2024-12-23T06:49:59Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - IRS-Enhanced Secure Semantic Communication Networks: Cross-Layer and Context-Awared Resource Allocation [30.000606717755833]
The challenge of eavesdropping poses a formidable threat to semantic privacy due to the open nature of wireless communications.
In this paper, intelligent reflective surface (IRS)-enhanced secure semantic communication (IRS-SSC) is proposed to guarantee the physical layer security from a task-oriented semantic perspective.
We propose a novel semantic awared state space (SCA-SS) to fusion the high-dimensional semantic space and the observable system state space.
arXiv Detail & Related papers (2024-11-04T05:40:30Z) - IncSAR: A Dual Fusion Incremental Learning Framework for SAR Target Recognition [13.783950035836593]
IncSAR is an incremental learning framework designed to tackle catastrophic forgetting in target recognition.
To mitigate the speckle noise inherent in SAR images, we employ a denoising module based on a neural network approximation.
Experiments on the MSTAR, SAR-AIRcraft-1.0, and OpenSARShip benchmark datasets demonstrate that IncSAR significantly outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2024-10-08T08:49:47Z) - Generalization Boosted Adapter for Open-Vocabulary Segmentation [15.91026999425076]
Generalization Boosted Adapter (GBA) is a novel adapter strategy that enhances the generalization and robustness of vision-language models.
As a simple, efficient, and plug-and-play component, GBA can be flexibly integrated into various CLIP-based methods.
arXiv Detail & Related papers (2024-09-13T01:49:12Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC)
SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.
We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.