Related papers: Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Towards Robust Visual Continual Learning with Multi-Prototype Supervision

URL: http://arxiv.org/abs/2509.16011v1
Date: Fri, 19 Sep 2025 14:24:48 GMT
Title: Towards Robust Visual Continual Learning with Multi-Prototype Supervision
Authors: Xiwei Liu, Yulong Li, Yichen Li, Xinlin Zhuang, Haolin Yang, Huifa Li, Imran Razzak,
Abstract summary: MuproCL is a novel framework that replaces the single target with multiple, context-aware prototypes.<n>A LogSumExp aggregation mechanism allows the vision model to adaptively align with the most relevant prototype for a given image.
Score: 24.987400887222762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language-guided supervision, which utilizes a frozen semantic target from a Pretrained Language Model (PLM), has emerged as a promising paradigm for visual Continual Learning (CL). However, relying on a single target introduces two critical limitations: 1) semantic ambiguity, where a polysemous category name results in conflicting visual representations, and 2) intra-class visual diversity, where a single prototype fails to capture the rich variety of visual appearances within a class. To this end, we propose MuproCL, a novel framework that replaces the single target with multiple, context-aware prototypes. Specifically, we employ a lightweight LLM agent to perform category disambiguation and visual-modal expansion to generate a robust set of semantic prototypes. A LogSumExp aggregation mechanism allows the vision model to adaptively align with the most relevant prototype for a given image. Extensive experiments across various CL baselines demonstrate that MuproCL consistently enhances performance and robustness, establishing a more effective path for language-guided continual learning.

Related papers

CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z)
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement [10.912635927529218]
Class-Incremental Semantic (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes.<n>This phenomenon involves Prototype-Feature Entanglement caused by semantic misalignment during the incremental process, and Background-Increment Entanglement due to dynamic data evolution.<n>We introduce a Language-inspired Bootstrapped Disentanglement framework (LBD)<n>We achieve state-of-the-art performance on both Pascal VOC and ADE20k, particularly in multi-step scenarios.
arXiv Detail & Related papers (2025-08-30T15:18:58Z)
Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models [21.20658517302458]
MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning) is a novel paradigm designed for conditional prompt generation.<n> AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks.<n>MPF mechanism integrates SCP andVCP with contextual prompts, ensuring seamless coordination.
arXiv Detail & Related papers (2025-07-11T08:45:27Z)
Dynamic Multimodal Prototype Learning in Vision-Language Models [44.84161970425967]
We introduce textbfProtoMM, a training-free framework that constructs multimodal prototypes to adapt vision-language models during the test time.<n>By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning.
arXiv Detail & Related papers (2025-07-04T15:31:47Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Towards Generative Class Prompt Learning for Fine-grained Visual Recognition [5.633314115420456]
Generative Class Prompt Learning and Contrastive Multi-class Prompt Learning are presented. Generative Class Prompt Learning improves visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation.
arXiv Detail & Related papers (2024-09-03T12:34:21Z)
Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples. Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs) For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z)
Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks. We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions [126.3136109870403]
We introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C) VPG-C infers and completes the missing details essential for comprehending demonstrative instructions. We build DEMON, a comprehensive benchmark for demonstrative instruction understanding.
arXiv Detail & Related papers (2023-08-08T09:32:43Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
Dual Prototypical Contrastive Learning for Few-shot Semantic Segmentation [55.339405417090084]
We propose a dual prototypical contrastive learning approach tailored to the few-shot semantic segmentation (FSS) task. The main idea is to encourage the prototypes more discriminative by increasing inter-class distance while reducing intra-class distance in prototype feature space. We demonstrate that the proposed dual contrastive learning approach outperforms state-of-the-art FSS methods on PASCAL-5i and COCO-20i datasets.
arXiv Detail & Related papers (2021-11-09T08:14:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.