Fugu-MT 論文翻訳(概要): Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

論文の概要: Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

arxiv url: http://arxiv.org/abs/2606.22409v1
Date: Sun, 21 Jun 2026 09:54:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 18:24:17.124459
Title: Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding
Title（参考訳）: 金点スナイパー:細粒度アクション理解のための自己誘導型視覚推論
Authors: Haodi Liu, Xinhang Yang, Kunda Yan, Sen Cui, Zeyu Zhang, Changshui Zhang,
Abstract要約: Gold Points Sniper (GPS)は、自己誘導型マルチモーダル推論機能を備えた軽量な視覚言語モデルを促進する新しいフレームワークである。我々の研究は、ロボットが人間の行動を安全に解釈できるように、家庭内ロボティクスにおけるきめ細かい行動理解のための信頼性の高い基盤を確立する。
参考スコア（独自算出の注目度）: 30.463645590107035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robots operating in everyday environments must understand fine-grained human actions, intentions, and contextual cues from broad views where people occupy only small regions, a capability unmet by current systems. While open-vocabulary action recognition methods remain limited to assigning predefined labels, and vision-language models (VLMs) face an inherent trade-off between informational richness and factual fidelity in their outputs, neither approach achieves the deep semantic interpretation required for reliable human-robot interaction. We propose Gold Points Sniper (GPS), a novel framework that empowers lightweight VLMs with self-guided multimodal reasoning capabilities for fine-grained human action understanding. Our approach comprises three key modules: Gold Points Extractor trains VLMs to identify critical action-relevant details, Selective Socratic Questioner validates and refines these details through selective self-questioning, and Semantic Entailment Evaluator quantitatively assesses factual consistency using semantic entailment classification. Extensive experiments on our curated instruction-tuning dataset based on the CAP benchmark demonstrate that GPS-enhanced lightweight VLMs achieve substantial performance improvements, with some models reaching performance comparable to proprietary GPT-4o while maintaining superior factual accuracy. Our work establishes a reliable foundation for fine-grained action understanding in domestic robotics, enabling robots to safely interpret human behavior through information-dense yet factually grounded descriptions. Source code, training configurations, annotation prompts, and dataset details are released at https://github.com/Haodi-Liu/GPS-Gold-Point-Sniper.
Abstract（参考訳）: 日常の環境で動くロボットは、人々が小さな領域のみを占有する広い視点から、人間の行動、意図、文脈を詳細に理解しなければなりません。オープン・ボキャブラリ・アクション認識法は、事前に定義されたラベルを割り当てることに限られており、視覚言語モデル(VLM)は、その出力における情報豊かさと事実的忠実さの間に固有のトレードオフに直面しているが、どちらのアプローチも信頼性の高い人間とロボットの相互作用に必要な深い意味論的解釈を達成できない。金点スナイパー(Gold Points Sniper, GPS)は,人間の行動理解のための自己誘導型マルチモーダル推論機能を備えた軽量なVLMを実現する新しいフレームワークである。提案手法は, 重要行動関連詳細を特定するためのVLM訓練, 選択的ソクラティック質問者による選択的自己問合せによる詳細の検証, セマンティック・エンテリメント・評価, セマンティック・エンテリメント・クラスによる事実整合性の定量的評価の3つの重要なモジュールから構成される。 CAPベンチマークに基づく実験により,GPSによる軽量VLMは,プロプライエタリなGPT-4oに匹敵する性能を保ちながら,優れた実測精度を維持しつつ,大幅な性能向上を実現していることが示された。我々の研究は、家庭内ロボット工学におけるきめ細かい行動理解のための信頼性の高い基盤を確立しており、ロボットは情報深遠かつ現実的な説明を通じて人間の行動を安全に解釈することができる。ソースコード、トレーニング設定、アノテーションプロンプト、データセットの詳細はhttps://github.com/Haodi-Liu/GPS-Gold-Point-Sniperで公開されている。

論文の概要: Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

関連論文リスト