Fugu-MT 論文翻訳(概要): CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

論文の概要: CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

arxiv url: http://arxiv.org/abs/2606.06978v1
Date: Fri, 05 Jun 2026 07:09:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.607039
Title: CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection
Title（参考訳）: CL-CLIP:CLIPに基づく連続学習フレームワーク
Authors: Zihan Liu, Yuguang Yang, Shengjie Su, Jianing Pang, Linlin Yang, Chunyu Xie, Nikolai Yu. Zolotykh, Baochang Zhang,
Abstract要約: 連続物体検出(COD)は、事前に学習したものを保存しながら、時間とともに新しいカテゴリを取得するために検出器を必要とする。最近のCLIPベースのオープンボキャブラリ検出器は強いゼロショットの一般化を示している。 CLIPベースのCODフレームワークであるCL-CLIPを提案する。
参考スコア（独自算出の注目度）: 25.86078021755795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.
Abstract（参考訳）: 連続物体検出(COD)は、事前に学習したものを保存しながら、時間とともに新しいカテゴリを取得するために検出器を必要とする。この目標はオープンな語彙検出と密接に関連している。どちらの設定も、現在のトレーニング段階で利用できるアノテーションで完全にカバーされていないカテゴリを推論する必要があるからだ。最近のCLIPベースのオープンボキャブラリ検出器は、強力なゼロショット一般化を示しており、F-ViTのようなフレームワークは、視覚言語による事前学習が、目に見えないカテゴリに対して強力なゼロショット検出能力を提供できることを示した。しかし、現実世界の展開は純粋にゼロショットに留まることはできない:これらの検出器が新しく導入されたカテゴリで継続的に更新されると、深刻な破滅的な忘れ込みに悩まされ、以前の校正された検出能力が急速に失われる。そこで我々は,CLIPをベースとしたCODフレームワークであるCL-CLIPを提案する。具体的には、CAT-Segに従って、視覚トークンとクラステキスト埋め込みの間のカテゴリワイド対応マップとして定義されるCLIP画像-テキスト類似度コストボリュームを計算する。このゼロショット空間先行は、共有領域の特徴をクラス固有の経路に分解し、その後、Multi-Expert RoIヘッドによって処理される。 PASCAL VOCとMS-COCOの大規模な実験により、CL-CLIPは連続微調整下でF-ViTベースラインを大幅に改善し、既存の連続物体検出器との競合性能、特に新しく導入されたカテゴリに適応し、競争力のあるベースクラス性能を維持した。

論文の概要: CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

関連論文リスト