Fugu-MT 論文翻訳(概要): Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

論文の概要: Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

arxiv url: http://arxiv.org/abs/2606.15202v1
Date: Sat, 13 Jun 2026 08:55:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:33.052925
Title: Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments
Title（参考訳）: 安全関連環境における人間の視線と視線モデルの比較
Authors: Marta Vallejo, Siwen Wang,
Abstract要約: 人間の視覚的注意は、人々が潜在的なリスクを含む環境をどう認識し、反応するかにおいて重要な役割を果たす。本研究では,大規模視覚言語モデルが,安全関連環境における人間の注意を惹きつけるシーンの同一領域を識別できるかどうかを検討する。
参考スコア（独自算出の注目度）: 2.770280158448976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.
Abstract（参考訳）: 人間の視覚的注意は、人々が潜在的なリスクを含む環境をどう認識し、反応するかにおいて重要な役割を果たす。本研究では,大規模視覚言語モデルが,安全関連環境における人間の注意を惹きつけるシーンの同一領域を識別できるかどうかを検討する。眼球追跡データは、プッピル・インビジブル・ウェアラブル・グラス(Pupil Invisible wearable glasses)を用いて、潜在的なリスクのレベルが異なる環境を表す33のシーン画像を見た10人の被験者から収集された。迷路座標を刺激画像にマッピングし、平均的なヒトの視線熱マップを生成した。並行して、GPT-4o は OpenAI Vision Application Programming Interface (API) を通じて視覚的注意の空間的予測を生成し、人間の視線パターンと比較するために唾液マップに変換された。 Pearson correlation (r = 0.515 +- 0.117), Normalized Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), Area Under the Receiver Operating Characteristics Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076)。 Gemini Pro、Gemini Flash、Claudeとのクロスモデル比較では、全てのモデルがAUC-Juddの基準値0.5を超え、正のNASスコアを達成した。 Gemini Proは4つの指標のうち3つで最強の空間局在を示したが、GPT-4oはKLの発散によって測定された人間の注意に最も近い分布一致を示した。これらの結果から,大規模な視覚言語モデルでは,視線追跡訓練データを必要とせずに,人間が安全関連シーンに視覚的注意を向ける領域を広く特定できることが示唆された。その結果、人間の注意パターンを近似するスケーラブルなツールとして、視覚言語モデルの可能性を強調した。

関連論文リスト

OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild [104.57404324262556]
現在の3次元視線推定法は、多様なデータ領域にまたがる一般化に苦慮している。 OmniGazeは3次元視線推定のための半教師付きフレームワークである。 OmniGazeは5つのデータセットで最先端のパフォーマンスを実現する。
論文参考訳（メタデータ） (2025-10-15T15:19:52Z)
Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling [0.0]
レヴィ・ウォークを通した動物は、資源の少ない環境に最適化された重い尾の階段を持つことが多い。人間の視覚的な視線は、画像の時と同様のダイナミクスに従うことを示す。人間の視覚探索が自然採餌の統計的法則に従属することを示す新たな証拠を提示し, 生成的および予測的枠組みによる視線モデリングの道を開いた。
論文参考訳（メタデータ） (2025-10-10T11:45:51Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactorは、よく確立された認知心理学評価から20の視覚中心のサブテストをデジタル化するベンチマークである。 GPT、Gemini、Claude、LLaMA、Qwen、SEEDファミリーから20のフロンティアマルチモーダル言語モデル(MLLM)を評価する。最高のパフォーマンスモデルは100点中25.19点のスコアしか得られず、精神的な回転、空間的関係推論、図形の識別といったタスクに一貫して失敗する。
論文参考訳（メタデータ） (2025-02-23T04:21:32Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCueは、HOI検出における視覚的特徴抽出を改善するための新しいアプローチである。コンテクストキューをインスタンスと相互作用検出器の両方に統合するマルチトウワーアーキテクチャを用いたトランスフォーマーベースの特徴抽出モジュールを開発した。
論文参考訳（メタデータ） (2023-11-26T09:11:32Z)
CUEING: a lightweight model to Capture hUman attEntion In driviNG [6.310770791023399]
本稿では,既存の視線データセットからノイズを除去する適応的浄化手法と,頑健で軽量な自己注意型視線予測モデルを提案する。提案手法は, モデル一般化可能性と性能を最大12.13%向上させるだけでなく, 最先端技術と比較して, モデル複雑性を最大98.2%低減させる。
論文参考訳（メタデータ） (2023-05-25T04:44:50Z)
Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
本研究では,葉型カメラを用いた人間とロボットが現場を探索する方法をエミュレートする手法を提案する。提案手法は,同数の視線シフトに対してF1スコアを2～3ポイント増加させる。
論文参考訳（メタデータ） (2022-08-24T14:59:28Z)
Improving saliency models' predictions of the next fixation with humans' intrinsic cost of gaze shifts [6.315366433343492]
我々は,次の視線目標を予測し,視線に対する人的コストを実証的に測定するための原則的枠組みを開発する。我々は、人間の視線嗜好の実装を提供する。これは、人間の次の視線目標に対する任意の正当性モデルの予測を改善するために使用できる。
論文参考訳（メタデータ） (2022-07-09T11:21:13Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPODは、グラフの注目ネットワークに基づいて身体のダイナミクスを予測する新しい方法です。実世界の課題を取り入れるために,各フレームで推定された身体関節が可視・視認可能かどうかを示す指標を学習する。評価の結果,TRiPODは,各軌道に特化して設計され,予測タスクに特化している。
論文参考訳（メタデータ） (2021-04-08T20:01:00Z)
360-Degree Gaze Estimation in the Wild Using Multiple Zoom Scales [26.36068336169795]
焦点を絞った表情から視線を推定する能力を模倣するモデルを開発した。このモデルは、クリアアイパッチを抽出する必要がない。モデルを拡張して、360度視線推定の課題に対処する。
論文参考訳（メタデータ） (2020-09-15T08:45:12Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。