Fugu-MT 論文翻訳(概要): Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

論文の概要: Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

arxiv url: http://arxiv.org/abs/2509.11862v1
Date: Mon, 15 Sep 2025 12:35:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-16 17:26:23.280831
Title: Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Title（参考訳）: ビデオ質問応答のためのブリッジング視覚言語モデルとシンボリックグラウンド
Authors: Haodi Ma, Vyom Pathak, Daisy Zhe Wang,
Abstract要約: Video Question Answering (VQA) は、ビデオにおける空間的、時間的、因果的な手がかりを推論するモデルを必要とする。最近の視覚言語モデル(VLM)は強い結果を得るが、しばしば浅い相関に頼り、時間的基盤の弱さと限定的な解釈可能性をもたらす。シンボルシーングラフ(SG)をVQAの中間グラウンド信号として検討する。本稿では,凍結したVLMとシーングラフのグラウンド化を,プロンプトと視覚的ローカライゼーションを通じて統合するモジュラーフレームワークであるSG-VLMを紹介する。
参考スコア（独自算出の注目度）: 4.215692222461999
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Question Answering (VQA) requires models to reason over spatial, temporal, and causal cues in videos. Recent vision language models (VLMs) achieve strong results but often rely on shallow correlations, leading to weak temporal grounding and limited interpretability. We study symbolic scene graphs (SGs) as intermediate grounding signals for VQA. SGs provide structured object-relation representations that complement VLMs holistic reasoning. We introduce SG-VLM, a modular framework that integrates frozen VLMs with scene graph grounding via prompting and visual localization. Across three benchmarks (NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM improves causal and temporal reasoning and outperforms prior baselines, though gains over strong VLMs are limited. These findings highlight both the promise and current limitations of symbolic grounding, and offer guidance for future hybrid VLM-symbolic approaches in video understanding.
Abstract（参考訳）: Video Question Answering (VQA) は、ビデオにおける空間的、時間的、因果的な手がかりを推論するモデルを必要とする。最近の視覚言語モデル(VLM)は強い結果を得るが、しばしば浅い相関に頼り、時間的基盤の弱さと限定的な解釈可能性をもたらす。シンボルシーングラフ(SG)をVQAの中間グラウンド信号として検討する。 SGは、VLMの全体論的推論を補完する構造化されたオブジェクト関係表現を提供する。本稿では,凍結したVLMとシーングラフのグラウンド化を,プロンプトと視覚的ローカライゼーションを通じて統合するモジュラーフレームワークであるSG-VLMを紹介する。 3つのベンチマーク(NExT-QA、iVQA、ActivityNet-QA)と複数のVLM(QwenVL、InternVL)にまたがって、SG-VLMは因果的および時間的推論を改善し、以前のベースラインより優れているが、強いVLMよりも優れている。これらの知見は,映像理解における将来的なVLM-シンボリックアプローチの指針として,象徴的グラウンドの約束と現在の限界の両方を浮き彫りにした。

論文の概要: Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

関連論文リスト