Fugu-MT 論文翻訳(概要): iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

論文の概要: iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

arxiv url: http://arxiv.org/abs/2509.19552v2
Date: Wed, 01 Oct 2025 06:54:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 17:16:29.748822
Title: iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Title（参考訳）: iFinder:Dash-Camビデオ再生のためのゼロショット・ビジョンベースのLLMグラウンド
Authors: Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich,
Abstract要約: iFinderは、ダッシュカムのビデオを大規模な言語モデルのための階層的で解釈可能なデータ構造に変換するセマンティックグラウンドディングフレームワークである。 iFinderはトレーニング不要のパイプラインとして動作し、トレーニング済みの視覚モデルを使用して重要な手がかりを抽出する。これは、4つのゼロショット駆動ベンチマークにおいて、エンドツーエンドのV-VLMよりも大幅に優れている。
参考スコア（独自算出の注目度）: 51.15353027471834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.
Abstract（参考訳）: 大規模言語モデル(LLM)を、ポストホックダッシュカメラ駆動ビデオ分析のようなドメイン固有のタスクでグラウンディングすることは、汎用的なトレーニングと構造的帰納バイアスの欠如により困難である。視覚はそのような分析のために利用可能な唯一のモダリティ(LiDAR、GPSなど)であるため、既存のビデオベースの視覚言語モデル(V-VLM)は、入力ビデオにおける事象の空間的推論、因果推論、説明可能性に苦しむ。そこで本研究では,ダッシュカムビデオからLLMの階層的・解釈可能なデータ構造への変換により,認識を推論から切り離す構造的セマンティックグラウンドディングフレームワークiFinderを紹介する。 iFinderは、事前訓練されたビジョンモデルを使用して、階層的にフレームとビデオレベルの構造に組織された、重要なキュー(オブジェクトポーズ、車線位置、オブジェクト軌跡)を抽出するモジュラーでトレーニング不要なパイプラインとして運用されている。 3ブロックプロンプト戦略と組み合わせて、LLMがピアV-VLMの出力を洗練し、正確な推論を行うためのステップワイズな基底推論を可能にする。 4つのパブリックダッシュカムビデオベンチマークの評価によると、iFinderが提案しているドメイン固有のキュー、特にオブジェクト指向とグローバルコンテキストは、4つのゼロショット駆動ベンチマークでエンドツーエンドのV-VLMよりも大幅に優れており、事故推論精度は最大で39%向上している。ドメイン固有の表現を駆動することで、iFinderはゼロショット、解釈可能、信頼性の高いV-VLMの代替手段を提供する。

論文の概要: iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

関連論文リスト