Fugu-MT 論文翻訳(概要): UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

論文の概要: UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

arxiv url: http://arxiv.org/abs/2604.23145v1
Date: Sat, 25 Apr 2026 05:07:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.180168
Title: UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks
Title（参考訳）: UpstreamQA: ビデオ質問回答タスクの明示的推論のためのモジュールフレームワーク
Authors: Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan,
Abstract要約: Video Question Answering (ビデオQA)は、空間的、時間的、言語的な手がかりを共同で推論するモデルを要求する。大推論モデル(LRM)は、解釈可能性を高める中間論理ステップを明示的に生成する。本稿では,アップストリーム推論モジュールによってコアビデオ推論コンポーネントをアンタングル化し,評価するモジュールフレームワークであるUpstreamQAを提案する。
参考スコア（独自算出の注目度）: 37.724232080494424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.
Abstract（参考訳）: Video Question Answering (ビデオQA)は、空間的、時間的、言語的な手がかりを共同で推論するモデルを要求する。しかしながら、タスク固有の複雑さは、しばしば、現在の大規模マルチモーダルモデル(LMM)が暗黙的に機能し、内部決定プロセスが不透明である、というマルチステップの推論を必要とする。対照的に、大きな推論モデル(LRM)は、解釈可能性を高め、マルチホップ推論精度を向上させるための中間論理ステップを明示的に生成する。しかし、これらのモデルは、通常静的なフレームサンプリングに依存するため、ネイティブなビデオ理解のために設計されていない。本稿では,アップストリーム推論モジュールによってコアビデオ推論コンポーネントをアンタングル化し,評価するモジュールフレームワークであるUpstreamQAを提案する。具体的には、ビデオQAのための下流LMMにリッチな推論トレースを渡す前に、オブジェクト識別とシーンコンテキスト生成を行うためにマルチモーダルLEMを用いる。 2つのLRM(o4-mini, Gemini 2.5 Pro)と2つのLMM(GPT-4o, Gemini 2.5 Flash)を用いて、OpenEQAおよびNExTQAデータセット上のUpstreamQAを評価する。以上の結果から,ビデオQAの性能と解釈性は著しく向上するが,ベースライン性能が十分に高い場合には性能が低下する可能性が示唆された。全体として、UpstreamQAは、明示的な推論とマルチモーダル理解を組み合わせるための原則化されたフレームワークを提供し、いくつかのシナリオにおいて、ビデオQAのパフォーマンスと診断の透明性を向上する。

論文の概要: UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

関連論文リスト