Fugu-MT 論文翻訳(概要): From Scene to Object: Text-Guided Dual-Gaze Prediction

論文の概要: From Scene to Object: Text-Guided Dual-Gaze Prediction

arxiv url: http://arxiv.org/abs/2604.20191v2
Date: Tue, 28 Apr 2026 03:54:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 14:06:43.786217
Title: From Scene to Object: Text-Guided Dual-Gaze Prediction
Title（参考訳）: シーンからオブジェクトへ:テキストガイドによるデュアルゲイズ予測
Authors: Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang,
Abstract要約: 解釈可能なドライバーの注意予測は、人間のような自動運転にとって不可欠である。既存のデータセットは、微粒なオブジェクトレベルのアノテーションではなく、シーンレベルのグローバルな視線のみを提供する。本稿では,データ構築からモデルアーキテクチャへの完全なパラダイムを確立するための,新しいデュアルブランチの視線予測フレームワークを提案する。
参考スコア（独自算出の注目度）: 17.32439183328327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.
Abstract（参考訳）: 解釈可能なドライバーの注意予測は、人間のような自動運転にとって不可欠である。しかし、既存のデータセットは、微粒なオブジェクトレベルのアノテーションではなく、シーンレベルのグローバルな視線のみを提供しており、本質的にはテキストグラウンドの認知モデリングをサポートしない。その結果、VLM(Vision-Language Models)は意味論的推論に大きな可能性を秘めているが、この重要なデータ制限は、テキストビジョンの疎結合と視覚バイアスの幻覚を引き起こす。本稿では、このボトルネックを克服し、オブジェクトレベルの正確な注意予測を実現するために、データ構築からモデルアーキテクチャへの完全なパラダイムを確立した、新しいデュアルブランチの視線予測フレームワークを提案する。まず、オブジェクトレベルのドライバー注意データセットであるG-W3DAを構築する。マルチモーダルな大言語モデルとSegment Anything Model 3 (SAM3)を統合することで、マクロなヒートマップを厳密なクロスバリデーションの下でオブジェクトレベルのマスクに分離し、アノテーションの幻覚を根本的に排除する。この高品質なデータ基盤を基盤として,DualGaze-VLMアーキテクチャを提案する。このアーキテクチャは、セマンティッククエリの隠された状態を抽出し、コンディション・アウェアのSEゲートを介して視覚的特徴を動的に変調し、インテント駆動の正確な空間アンカーを実現する。 W3DAベンチマークの大規模な実験は、DualGaze-VLMが空間アライメントのメトリクスにおいて既存の最先端(SOTA)モデルを一貫して上回り、特に安全クリティカルなシナリオ下でのSimisity(SIM)の最大17.8%の改善を実現していることを示している。さらに、視覚的チューリングテストでは、DualGaze-VLMが生成した注意熱マップが、人間の評価者の88.22%によって本物であると認識され、合理的な認知的先行性を生成する能力が証明されている。

論文の概要: From Scene to Object: Text-Guided Dual-Gaze Prediction

関連論文リスト