Fugu-MT 論文翻訳(概要): Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

論文の概要: Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

arxiv url: http://arxiv.org/abs/2605.19859v2
Date: Thu, 21 May 2026 18:05:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 14:44:53.69653
Title: Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
Title（参考訳）: VLMの視線:視覚言語モデルにおける視線追従と社会的視線予測のベンチマーク
Authors: Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez,
Abstract要約: 視覚言語モデル(VLM)における視線理解のためのシステム評価フレームワークEyeVLMを提案する。視線理解能力を評価するために,2つの中核課題に焦点をあてる。第2の社会的視線予測は、多対人インタラクションに対する社会的および関係的な推論を必要とする。
参考スコア（独自算出の注目度）: 20.954224027029625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.
Abstract（参考訳）: 視覚言語モデル(VLM)は、ゼロショットを強く一般化した汎用マルチモーダル推論器へと急速に進化してきた。この文脈では、VLMは人間の視線と注意の分析に大いに役立ち、人間の行動理解の中心的なタスクであり、身体的なシーンだけでなく、活動、相互作用、社会的文脈についても推論する必要がある。しかしながら、VLMが人間の視線や関連する注意行動を確実に理解できる範囲は、まだ明らかにされていない。本研究では,2つの相補的次元(タスクとモデル)にわたるVLMの視線理解のための系統的評価フレームワークであるEyeVLMを提案する。視線理解能力を評価するために,2つのコアタスクに焦点をあてる。最初の、つまり、人が見ている2D位置を予測し、幾何学的かつ視覚的な処理焦点を持ち、人間の顔、注意方向、3Dシーン構造、入場対象の空間的接地を正確に理解する必要がある。第2の社会的視線予測は、マルチパーソンインタラクション(例えば、相互視線と共有注意)に対する社会的および関係的推論を必要とし、VLM内のLLM意味推論能力の恩恵を受ける可能性がある。モデルに関してEyeVLMは、これらのタスクを2つの方法で評価する: さまざまな最先端のオープンソースVLMとクローズドソースVLMによるゼロショット設定、異なるプロンプト戦略の探索、タスク固有のQAペアに基づく微調整アプローチ、モデルスケールとデータスケールの影響を研究する。ベンチマークとして、既存の視線理解データセットに依存し、最先端の純粋視覚モデルと体系的に比較する。その結果,現在のVLMには正確な視線理解能力がないことがわかった。標準的なトレーニングは視覚モデルとのギャップを減らすのに役立ちますが、大きな改善が必要です。

論文の概要: Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

関連論文リスト