Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
- URL: http://arxiv.org/abs/2509.22645v1
- Date: Fri, 26 Sep 2025 17:59:51 GMT
- Title: Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
- Authors: Zhen-Hao Wen, Yan Wang, Ji Feng, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou,
- Abstract summary: Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams.<n>Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task.<n>We introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL.
- Score: 80.2317078787969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as "a photo of a [CLASS]", which overlook the hierarchical nature of visual concepts. For example, recognizing "cat" versus "car" depends on coarse-grained cues, while distinguishing "cat" from "lion" requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
Related papers
- Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning [11.82771798674077]
Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge.<n>But real-world visual and linguistic concepts are inherently hierarchical.<n>We propose HASTEN that anchors hierarchical information into CIL to reduce catastrophic forgetting.
arXiv Detail & Related papers (2025-11-19T17:14:47Z) - Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.<n>We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.<n>Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Online Continual Learning on Hierarchical Label Expansion [28.171890301966616]
We propose a novel multi-level hierarchical class incremental task configuration with an online learning constraint, called hierarchical label expansion (HLE)
Our configuration allows a network to first learn coarse-grained classes, with data labels continually expanding to more fine-grained classes in various hierarchy depths.
Our experiments demonstrate that our proposed method can effectively use hierarchy on our HLE setup to improve classification accuracy across all levels of hierarchies.
arXiv Detail & Related papers (2023-08-28T07:42:26Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning [53.32576252950481]
Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data.
In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks.
arXiv Detail & Related papers (2023-05-19T07:39:17Z) - CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets [24.868024094095983]
Open vocabulary models (e.g. CLIP) have shown strong performance on zero-shot classification.
We propose Classification with Hierarchical Label Sets (or CHiLS) for datasets with implicit semantic hierarchies.
CHiLS is simple to implement within existing zero-shot pipelines and requires no additional training cost.
arXiv Detail & Related papers (2023-02-06T03:59:15Z) - OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal
Regression [94.28253749970534]
We propose to learn the rank concepts from the rich semantic CLIP latent space.
OrdinalCLIP consists of learnable context tokens and learnable rank embeddings.
Experimental results show that our paradigm achieves competitive performance in general ordinal regression tasks.
arXiv Detail & Related papers (2022-06-06T03:54:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.