Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network
- URL: http://arxiv.org/abs/2304.01198v2
- Date: Mon, 7 Aug 2023 06:24:13 GMT
- Title: Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network
- Authors: Cong Han, Yujie Zhong, Dengjie Li, Kai Han, Lin Ma
- Abstract summary: We propose a network that only needs a single pass through the visual-language model for each input image.
We first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder.
We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification.
- Score: 26.97153244517095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the open-vocabulary semantic segmentation problem has attracted
increasing attention and the best performing methods are based on two-stream
networks: one stream for proposal mask generation and the other for segment
classification using a pretrained visual-language model. However, existing
two-stream methods require passing a great number of (up to a hundred) image
crops into the visual-language model, which is highly inefficient. To address
the problem, we propose a network that only needs a single pass through the
visual-language model for each input image. Specifically, we first propose a
novel network adaptation approach, termed patch severance, to restrict the
harmful interference between the patch embeddings in the pre-trained visual
encoder. We then propose classification anchor learning to encourage the
network to spatially focus on more discriminative features for classification.
Extensive experiments demonstrate that the proposed method achieves outstanding
performance, surpassing state-of-the-art methods while being 4 to 7 times
faster at inference. Code: https://github.com/CongHan0808/DeOP.git
Related papers
- Side Adapter Network for Open-Vocabulary Semantic Segmentation [69.18441687386733]
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN)
A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias.
Our approach significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed.
arXiv Detail & Related papers (2023-02-23T18:58:28Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - Compare learning: bi-attention network for few-shot learning [6.559037166322981]
One of the Few-shot learning methods called metric learning addresses this challenge by first learning a deep distance metric to determine whether a pair of images belong to the same category.
In this paper, we propose a novel approach named Bi-attention network to compare the instances, which can measure the similarity between embeddings of instances precisely, globally and efficiently.
arXiv Detail & Related papers (2022-03-25T07:39:10Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Distribution Alignment: A Unified Framework for Long-tail Visual
Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition.
We then introduce a generalized re-weight method in the two-stage learning to balance the class prior.
Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z) - Train a One-Million-Way Instance Classifier for Unsupervised Visual
Representation Learning [45.510042484456854]
This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level computation.
The overall framework is a replica of a supervised classification model, where semantic classes (e.g., dog, bird, and ship) are replaced by instance IDs.
scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax classifier; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy.
arXiv Detail & Related papers (2021-02-09T14:44:18Z) - Find it if You Can: End-to-End Adversarial Erasing for Weakly-Supervised
Semantic Segmentation [6.326017213490535]
We propose a novel formulation of adversarial erasing of the attention maps.
The proposed solution does not require saliency masks, instead it uses a regularization loss to prevent the attention maps from spreading to less discriminative object regions.
Our experiments on the Pascal VOC dataset demonstrate that our adversarial approach increases segmentation performance by 2.1 mIoU compared to our baseline and by 1.0 mIoU compared to previous adversarial erasing approaches.
arXiv Detail & Related papers (2020-11-09T18:35:35Z) - CRNet: Cross-Reference Networks for Few-Shot Segmentation [59.85183776573642]
Few-shot segmentation aims to learn a segmentation model that can be generalized to novel classes with only a few training images.
With a cross-reference mechanism, our network can better find the co-occurrent objects in the two images.
Experiments on the PASCAL VOC 2012 dataset show that our network achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-03-24T04:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.