OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation
- URL: http://arxiv.org/abs/2410.11536v2
- Date: Mon, 13 Oct 2025 11:59:16 GMT
- Title: OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation
- Authors: Dongjun Hwang, Yejin Kim, Minyoung Lee, Seong Joon Oh, Junsuk Choe,
- Abstract summary: Open-Vocabulary (OVS) aims to segment classes that are not present in the training dataset.<n>We propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework.<n>We show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets.
- Score: 26.018655577919617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.
Related papers
- Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation [18.666044903856363]
Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning.<n>We introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval.<n>Our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training.
arXiv Detail & Related papers (2025-06-26T13:22:03Z) - Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning [64.1745161657794]
Domain-Incremental Learning (DIL) involves the progressive adaptation of a model to new concepts across different domains.
Recent advances in pre-trained models provide a solid foundation for DIL.
However, learning new concepts often results in the catastrophic forgetting of pre-trained knowledge.
We propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge.
arXiv Detail & Related papers (2024-10-01T17:58:06Z) - Embedding And Clustering Your Data Can Improve Contrastive Pretraining [0.0]
We explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm.
Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset.
arXiv Detail & Related papers (2024-07-26T17:36:40Z) - OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation [54.98688607911399]
We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs)
Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries.
Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
arXiv Detail & Related papers (2024-05-30T15:16:06Z) - Contrastive Continual Multi-view Clustering with Filtered Structural
Fusion [57.193645780552565]
Multi-view clustering thrives in applications where views are collected in advance.
It overlooks scenarios where data views are collected sequentially, i.e., real-time data.
Some methods are proposed to handle it but are trapped in a stability-plasticity dilemma.
We propose Contrastive Continual Multi-view Clustering with Filtered Structural Fusion.
arXiv Detail & Related papers (2023-09-26T14:18:29Z) - Multivariate Prototype Representation for Domain-Generalized Incremental
Learning [35.83706574551515]
We design a DGCIL approach that remembers old classes, adapts to new classes, and can classify reliably objects from unseen domains.
Our loss formulation maintains classification boundaries and suppresses the domain-specific information of each class.
arXiv Detail & Related papers (2023-09-24T06:42:04Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - A Distinct Unsupervised Reference Model From The Environment Helps
Continual Learning [5.332329421663282]
Open-Set Semi-Supervised Continual Learning (OSSCL) is a more realistic semi-supervised continual learning setting.
We present a model with two distinct parts: (i) the reference network captures general-purpose and task-agnostic knowledge in the environment by using a broad spectrum of unlabeled samples, and (ii) the learner network is designed to learn task-specific representations by exploiting supervised samples.
arXiv Detail & Related papers (2023-01-11T15:05:36Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Forget Less, Count Better: A Domain-Incremental Self-Distillation
Learning Benchmark for Lifelong Crowd Counting [51.44987756859706]
Off-the-shelf methods have some drawbacks to handle multiple domains.
Lifelong Crowd Counting aims at alleviating the catastrophic forgetting and improving the generalization ability.
arXiv Detail & Related papers (2022-05-06T15:37:56Z) - Unified Instance and Knowledge Alignment Pretraining for Aspect-based
Sentiment Analysis [96.53859361560505]
Aspect-based Sentiment Analysis (ABSA) aims to determine the sentiment polarity towards an aspect.
There always exists severe domain shift between the pretraining and downstream ABSA datasets.
We introduce a unified alignment pretraining framework into the vanilla pretrain-finetune pipeline.
arXiv Detail & Related papers (2021-10-26T04:03:45Z) - Multiband VAE: Latent Space Partitioning for Knowledge Consolidation in
Continual Learning [14.226973149346883]
Acquiring knowledge about new data samples without forgetting previous ones is a critical problem of continual learning.
We propose a new method for unsupervised continual knowledge consolidation in generative models that relies on the partitioning of Variational Autoencoder's latent space.
On top of the standard continual learning evaluation benchmarks, we evaluate our method on a new knowledge consolidation scenario and show that the proposed approach outperforms state-of-the-art by up to twofold.
arXiv Detail & Related papers (2021-06-23T06:58:40Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - A Procedural World Generation Framework for Systematic Evaluation of
Continual Learning [2.599882743586164]
We introduce a computer graphics simulation framework that repeatedly renders only upcoming urban scene fragments.
At its core lies a modular parametric generative model with adaptable generative factors.
arXiv Detail & Related papers (2021-06-04T16:31:43Z) - Universal Representation Learning from Multiple Domains for Few-shot
Classification [41.821234589075445]
We propose to learn a single set of universal deep representations by distilling knowledge of multiple separately trained networks.
We show that the universal representations can be further refined for previously unseen domains by an efficient adaptation step.
arXiv Detail & Related papers (2021-03-25T13:49:12Z) - A Batch Normalization Classifier for Domain Adaptation [0.0]
Adapting a model to perform well on unforeseen data outside its training set is a common problem that continues to motivate new approaches.
We demonstrate that application of batch normalization in the output layer, prior to softmax activation, results in improved generalization across visual data domains in a refined ResNet model.
arXiv Detail & Related papers (2021-03-22T08:03:44Z) - $n$-Reference Transfer Learning for Saliency Prediction [73.17061116358036]
We propose a few-shot transfer learning paradigm for saliency prediction.
The proposed framework is gradient-based and model-agnostic.
The results show that the proposed framework achieves a significant performance improvement.
arXiv Detail & Related papers (2020-07-09T23:20:44Z) - Unsupervised Intra-domain Adaptation for Semantic Segmentation through
Self-Supervision [73.76277367528657]
Convolutional neural network-based approaches have achieved remarkable progress in semantic segmentation.
To cope with this limitation, automatically annotated data generated from graphic engines are used to train segmentation models.
We propose a two-step self-supervised domain adaptation approach to minimize the inter-domain and intra-domain gap together.
arXiv Detail & Related papers (2020-04-16T15:24:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.