Fugu-MT 論文翻訳(概要): DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

論文の概要: DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

arxiv url: http://arxiv.org/abs/2604.14684v1
Date: Thu, 16 Apr 2026 06:40:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.762768
Title: DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
Title（参考訳）: DETR-ViP:ロバスト識別型視覚プロンプトを用いた検出変換器
Authors: Bo Qian, Dahu Shi, Xing Wei,
Abstract要約: クラス識別可能な視覚的プロンプトを生成する頑健なオブジェクト検出フレームワークであるDETR-ViPを提案する。 DETR-ViPは、基本的な画像・テキスト・コントラスト学習に加えて、グローバル・プロンプト統合と視覚・テキスト・プロンプト関係蒸留を取り入れている。 COCO、LVIS、ODinW、Roboflow100の実験は、DETR-ViPが視覚的プロンプト検出において、かなり高い性能を達成することを示した。
参考スコア（独自算出の注目度）: 11.577330098443696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.
Abstract（参考訳）: 視覚的に誘導されるオブジェクト検出は、対話的で柔軟なターゲットカテゴリの定義を可能にし、オープン語彙検出を容易にする。視覚的プロンプトは画像の特徴から直接導出されるため、まれなカテゴリを認識する際にはテキストプロンプトよりも優れていることが多い。それでも、視覚的誘発検出の研究はほとんど見落とされ、通常は訓練用テキスト誘発検出器の副産物として扱われ、開発を妨げている。視覚的プロンプト検出の可能性を完全に解き明かすために,その性能が最適以下である理由を調査し,その根底にある問題は,視覚的プロンプトにおけるグローバルな識別性の欠如にあることを明らかにする。これらの観測により,クラス区別可能な視覚的プロンプトを生成する頑健なオブジェクト検出フレームワークであるDETR-ViPを提案する。 DETR-ViPは、基本的な画像テキストコントラスト学習に加えて、より識別的なプロンプト表現を学ぶために、グローバルプロンプト積分と視覚テキストプロンプト関係蒸留を取り入れている。さらに、DETR-ViPは安定かつ堅牢な検出を保証する選択的融合戦略を採用している。 COCO、LVIS、ODinW、Roboflow100の大規模な実験は、DETR-ViPが他の最先端技術と比較して、視覚的プロンプト検出においてかなり高い性能を達成することを示した。一連のアブレーション研究と分析は、提案された改善の有効性をさらに検証し、視覚的プロンプトの検出能力の向上の基礎となる理由について光を当てた。

論文の概要: DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

関連論文リスト