Fugu-MT 論文翻訳(概要): Hierarchical Representation Matching for CLIP-based Class-Incremental Learning

論文の概要: Hierarchical Representation Matching for CLIP-based Class-Incremental Learning

arxiv url: http://arxiv.org/abs/2509.22645v1
Date: Fri, 26 Sep 2025 17:59:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.640607
Title: Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
Title（参考訳）: CLIPに基づくクラスインクリメンタル学習のための階層的表現マッチング
Authors: Zhen-Hao Wen, Yan Wang, Ji Feng, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou,
Abstract要約: クラスインクリメンタルラーニング(Class-Incremental Learning, CIL)は、進化するデータストリームに継続的に適応可能なモデルを提供することを目的とする。事前訓練された視覚言語モデル(例えばCLIP)の最近の進歩は、このタスクの強力な基盤を提供する。本稿では,CLIPベースのCILのためのHiErarchical Representation MAtchiNg(HERMAN)を紹介する。
参考スコア（独自算出の注目度）: 80.2317078787969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as "a photo of a [CLASS]", which overlook the hierarchical nature of visual concepts. For example, recognizing "cat" versus "car" depends on coarse-grained cues, while distinguishing "cat" from "lion" requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
Abstract（参考訳）: クラスインクリメンタルラーニング(Class-Incremental Learning, CIL)は、進化するデータストリームに継続的に適応可能なモデルを提供することを目的とする。事前学習された視覚言語モデル(例えばCLIP)の最近の進歩は、このタスクの強力な基盤を提供する。しかし、既存のアプローチは「[CLASS]の写真」のような、視覚概念の階層的な性質を無視する単純なテンプレートに依存していることが多い。例えば、"cat" と "car" の認識は粗粒度に依存するが、"cat" と "lion" を区別するには細粒度の詳細が必要である。同様に、CLIPの現在のフィーチャーマッピングは、前のレイヤに含まれる階層的な情報を無視して、最後のレイヤからの表現のみに依存している。本稿では、CLIPベースのCILのためのHiErarchical Representation MAtchiNg(HERMAN)を紹介する。提案手法では, LLMを用いて識別的テキスト記述子を再帰的に生成し, 意味空間を明示的な階層的手がかりで拡張する。これらの記述子は意味階層の異なるレベルと一致し、タスク固有の要求に基づいて適応的にルーティングされる。複数のベンチマークでの大規模な実験により,我々の手法が常に最先端の性能を達成できることが示されている。

論文の概要: Hierarchical Representation Matching for CLIP-based Class-Incremental Learning

関連論文リスト