Fugu-MT 論文翻訳(概要): The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

論文の概要: The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

arxiv url: http://arxiv.org/abs/2605.03642v1
Date: Tue, 05 May 2026 11:14:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:43.913974
Title: The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection
Title（参考訳）: 検出器が自己評価する:オープンボキャブラリ物体検出のための軽量自己監督型適応
Authors: Yazhe Wan, Changjae Oh,
Abstract要約: Open-vocabulary Object Detectionは、大規模画像テキストデータに基づいて事前訓練された視覚言語モデル(VLM)を活用するオープンセットカテゴリからオブジェクトを認識することを目的としている。本稿では、協調モデルに基づく物体検出のためのVLMを改善するための自己教師型微調整手法であるDecoupled Adaptivity Trainingを提案する。 COCOとLVISデータセットの実験は、DATが新しいカテゴリと既知のカテゴリの両方における検出性能を一貫して改善していることを示している。
参考スコア（独自算出の注目度）: 8.847667302225156
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the visual backbone of the VLM in a decoupled manner, which enhances local feature alignment while preserving global semantic knowledge via weight interpolation. DAT is a plug-and-play module that requires no inference overhead and fine-tunes less than 0.8M parameters. Experiments on the COCO and LVIS datasets show that DAT consistently improves detection performance on both novel and known categories, establishing a new state of the art in cooperative open-vocabulary detection.
Abstract（参考訳）: オープン語彙オブジェクト検出は、大規模画像テキストデータに基づいて事前学習された視覚言語モデル(VLM)を活用する、オープンなカテゴリからオブジェクトを認識することを目的としている。協調パラダイムは、オブジェクト検出器とVLMを組み合わせることで、新しいオブジェクトのゼロショット認識を実現する。しかしながら、フルイメージで事前トレーニングされたVLMは、ローカルオブジェクトの詳細をキャプチャするのに苦労することが多く、リージョンレベルの検出に適用した場合の有効性が制限される。本稿では、協調モデルに基づく物体検出のためのVLMを改善するための自己教師型微調整手法であるDecoupled Adaptivity Training (DAT)を提案する。協調モデルが閉集合検出器とVLMで構成されていることを前提として,我々はまず,新規な対象に対応する領域が存在するがラベルが付かない領域やラベルが付かない領域を含む,事前訓練された閉集合オブジェクト検出器を用いて,領域対応の擬似ラベル付きデータセットを構築した。次に、VLMの視覚的バックボーンを疎結合で微調整し、重み補間によるグローバルな意味知識を維持しながら、局所的な特徴アライメントを高める。 DATはプラグイン・アンド・プレイモジュールで、推測オーバーヘッドや0.8M未満の微調整を必要としない。 COCOデータセットとLVISデータセットの実験により、DATは新規および既知のカテゴリの検知性能を一貫して改善し、協調的なオープン語彙検出における新しい最先端技術を確立した。

論文の概要: The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

関連論文リスト