Fugu-MT 論文翻訳(概要): DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

論文の概要: DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

arxiv url: http://arxiv.org/abs/2603.23455v1
Date: Tue, 24 Mar 2026 17:26:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.610777
Title: DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection
Title（参考訳）: DetPO:Few-Shotオブジェクト検出のためのマルチモードLLMを用いたインコンテキスト学習
Authors: Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan,
Abstract要約: 勾配のないテスト時間最適化手法である検出プロンプト最適化(DetPO)を提案する。提案手法は,Roboflow20-VLおよびLVIS上の一般MLLMに対して一貫した改善をもたらす。
参考スコア（独自算出の注目度）: 39.153744982595036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO
Abstract（参考訳）: MLLM(Multi-Modal LLM)は、OdinW-13やRefCOCOのような一般的なオブジェクト検出ベンチマークにおいて、強力な視覚的グラウンド機能を示す。しかし、最先端のモデルは、通常訓練前のトレーニングでは見つからない、配布外のクラス、タスク、画像のモダリティに一般化するのに依然として苦労している。コンテキスト内プロンプトは様々なタスクにおけるパフォーマンス向上のための一般的な戦略であるが,クラス名のみのプロンプトよりも検出精度が低いことも多い。これは、現在のMLLMは、オブジェクト検出のために、少数ショットの視覚例やリッチなテキスト記述を効果的に活用できないことを示唆している。フロンティアMLLMは一般的にAPI経由でのみアクセス可能であり、最先端のオープンウェイトモデルではコンシューマグレードのハードウェアを微調整するコストが禁じられているため、少数のオブジェクト検出のためのブラックボックスプロンプト最適化を探索する。そこで本研究では,テキストのみのプロンプトを改良し,予測信頼度を調整しながら,数ショットの視覚訓練例における検出精度を最大化することにより,テキストのみのプロンプトを改良する検出プロンプト最適化(DetPO)を提案する。提案手法は,Roboflow20-VLとLVISの一般MLLMに対して一貫した改良を行い,従来のブラックボックスアプローチよりも最大9.7%向上した。私たちのコードはhttps://github.com/ggare-cmu/DetPOで利用可能です。

論文の概要: DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

関連論文リスト