Fugu-MT 論文翻訳(概要): NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

論文の概要: NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

arxiv url: http://arxiv.org/abs/2509.18041v1
Date: Mon, 22 Sep 2025 17:15:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.53186
Title: NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning
Title（参考訳）: NeuS-QA:時相論理とニューロシンボリック推論における長時間のビデオ理解
Authors: Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali,
Abstract要約: LVQA(Long-Form Question Answering)は、従来の視覚的質問応答(VQA)を超えた課題を提起するバニラはフレームを均一にサンプリングし、問題のあるVLMに供給し、重要なトークンオーバーヘッドを発生させる。 NeuS-QAは自然言語を形式的時間論理表現に変換し、フレームレベルの意味論的命題からビデオオートマトンを構築する。
参考スコア（独自算出の注目度）: 25.109179044490844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which sample frames uniformly and feed them to a VLM with the question, incur significant token overhead, forcing severe downsampling. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues, ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships. As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments satisfying the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on LongVideoBench and CinePile show NeuS-QA improves performance by over 10%, especially on questions involving event ordering, causality, and multi-step compositional reasoning.
Abstract（参考訳）: Long-Form Video Question Answering (LVQA)は、静的画像や短いビデオクリップに制限される従来の視覚的質問応答(VQA)を超えた課題を提起する。現在の視覚言語モデル(VLM)は、これらの設定ではよく機能するが、多段階の時間的推論と因果関係を含む長いビデオに対して、LVQAの複雑なクエリと競合する。サンプルフレームを均一にサンプリングしてVLMに供給するVanillaは、重要なトークンオーバーヘッドを発生させ、深刻なダウンサンプリングを強制する。結果として、モデルは細かな視覚構造、微妙なイベント遷移、あるいは重要な時間的手がかりを見逃し、最終的には誤った答えにつながることが多い。これらの制限に対処するため、近年の研究では、クエリ適応型フレームサンプリング、階層的なキーフレーム選択、エージェントベースの反復クエリについて検討している。しかし、これらの手法は基本的にヒューリスティックであり、明示的な時間的表現がなく、論理的な事象の関係を強制または検証することはできない。結果として、サンプルコンテキストが実際に質問によって要求される構成論理や因果論理を符号化するという正式な保証はない。基礎的なギャップに対処するために,LVQAのためのトレーニングフリーでプラグアンドプレイのニューロシンボリックパイプラインであるNeuS-QAを導入する。 NeuS-QAは、自然言語質問を形式的時間論理式に変換し、フレームレベルの意味命題からビデオオートマトンを構築し、モデルのチェックを適用して、問題の論理的要求を満たすビデオセグメントを厳格に識別する。これらの論理的検証されたセグメントだけがVLMに送信されるため、解釈性が向上し、幻覚が減少し、モデルの変更や微調整なしに構成的推論が可能になる。 LongVideoBenchとCinePileの実験では、NeuS-QAは、特にイベントの順序付け、因果性、多段階の合成推論に関する質問において、10%以上のパフォーマンス向上を実現している。

論文の概要: NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

関連論文リスト