ConSept: Continual Semantic Segmentation via Adapter-based Vision
Transformer
- URL: http://arxiv.org/abs/2402.16674v1
- Date: Mon, 26 Feb 2024 15:51:45 GMT
- Title: ConSept: Continual Semantic Segmentation via Adapter-based Vision
Transformer
- Authors: Bowen Dong, Guanglei Yang, Wangmeng Zuo, Lei Zhang
- Abstract summary: We propose Continual semantic benchmarks via Adapter-based ViT, namely ConSept.
ConSept integrates lightweight attention-based adapters into vanilla ViTs.
We propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps.
- Score: 65.32312196621938
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we delve into the realm of vision transformers for continual
semantic segmentation, a problem that has not been sufficiently explored in
previous literature. Empirical investigations on the adaptation of existing
frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs
or fine-tuning ViTs with distillation terms is advantageous for enhancing the
segmentation capability of novel classes. These findings motivate us to propose
Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within
the simplified architecture of ViT with linear segmentation head, ConSept
integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing
on the feature adaptation abilities of these adapters, ConSept not only retains
superior segmentation ability for old classes, but also attains promising
segmentation quality for novel classes. To further harness the intrinsic
anti-catastrophic forgetting ability of ConSept and concurrently enhance the
segmentation capabilities for both old and new classes, we propose two key
strategies: distillation with a deterministic old-classes boundary for improved
anti-catastrophic forgetting, and dual dice losses to regularize segmentation
maps, thereby improving overall segmentation performance. Extensive experiments
show the effectiveness of ConSept on multiple continual semantic segmentation
benchmarks under overlapped or disjoint settings. Code will be publicly
available at \url{https://github.com/DongSky/ConSept}.
Related papers
- Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation [0.0]
Self-supervised vision transformers (ViTs) contain strong semantic and positional information relevant to downstream tasks like object localization and segmentation.
Recent works combine these features with traditional methods like clustering, graph partitioning or region correlations to achieve impressive baselines without finetuning or training additional networks.
arXiv Detail & Related papers (2024-10-20T13:01:53Z) - ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning [54.68180752416519]
Panoptic segmentation is a cutting-edge computer vision task.
We introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning, dubbed ECLIPSE.
Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings, addressing both catastrophic forgetting and plasticity.
arXiv Detail & Related papers (2024-03-29T11:31:12Z) - Semantic Segmentation using Vision Transformers: A survey [0.0]
Convolutional neural networks (CNN) and Vision Transformers (ViTs) provide the architecture models for semantic segmentation.
ViTs have proven success in image classification, they cannot be directly applied to dense prediction tasks such as image segmentation and object detection.
This survey aims to review and compare the performances of ViT architectures designed for semantic segmentation using benchmarking datasets.
arXiv Detail & Related papers (2023-05-05T04:11:00Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Representation Separation for Semantic Segmentation with Vision
Transformers [11.431694321563322]
Vision transformers (ViTs) encoding an image as a sequence of patches bring new paradigms for semantic segmentation.
We present an efficient framework of representation separation in local-patch level and global-region level for semantic segmentation with ViTs.
arXiv Detail & Related papers (2022-12-28T09:54:52Z) - SegViT: Semantic Segmentation with Plain Vision Transformers [91.50075506561598]
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation.
We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks.
Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
arXiv Detail & Related papers (2022-10-12T00:30:26Z) - Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic
Segmentation [48.7190017311309]
We find that straightforwardly applying local ViTs in domain adaptive semantic segmentation does not bring in expected improvement.
These high-frequency components make the training of local ViTs very unsmooth and hurt their transferability.
In this paper, we introduce a low-pass filtering mechanism, momentum network, to smooth the learning dynamics of target domain features and pseudo labels.
arXiv Detail & Related papers (2022-03-15T15:20:30Z) - Segmenter: Transformer for Semantic Segmentation [79.9887988699159]
We introduce Segmenter, a transformer model for semantic segmentation.
We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation.
It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
arXiv Detail & Related papers (2021-05-12T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.