Fugu-MT 論文翻訳(概要): Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

論文の概要: Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

arxiv url: http://arxiv.org/abs/2603.15008v1
Date: Mon, 16 Mar 2026 09:15:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:57.905739
Title: Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
Title（参考訳）: Clue Matters: ビデオ推論に潜入したビジュアルクレームを活用する
Authors: Kaixin zhang, Xiaohe Li, Jiahao Li, Haohua Wu, Xinyu Zhao, Zide Fan, Lei Wang,
Abstract要約: この研究はMLLMビデオ理解における知覚と世代間のギャップを埋め、ビデオQAアプリケーションのための解釈可能で忠実な推論パラダイムを提供する。階層的人間の視覚認知に着想を得たClueNetを提案する。
参考スコア（独自算出の注目度）: 14.945921705882725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.
Abstract（参考訳）: MLLM(Multi-modal Large Language Models)はビデオ推論を著しく進歩させるが,ビデオ質問回答(Video Question Answering, VideoQA)は時間的因果推論と根拠に基づく回答生成の要求により依然として困難である。エンド・ツー・エンドのMLLMフレームワークでは、視覚知覚と回答導出の間に明確な構造的推論が欠如しており、深刻な幻覚と解釈可能性の低下を引き起こしている。既存の手法では、忠実な視覚的手がかり抽出、ユーティリティ対応のヒントフィルタリング、エンドツーエンドのヒント-問合せアライメントの3つのコアギャップにも対処できない。階層的人間の視覚認知に着想を得たClueNetを提案する。ClueNetは、2段階の教師付き微調整パラダイムをベースモデル修正を伴わない,手掛かりを意識したビデオ推論フレームワークである。分離された監督は、手がかり抽出と連鎖に基づく推論を整列させ、適応的なヒントフィルタによる推論監督は、効率的な推論のための軽量モジュールとともに高次推論を洗練させる。 NExT-QA、STAR、MVBenchの実験により、ClueNetは、より優れた一般化、幻覚緩和、推論効率、およびクロスバックボーン互換性を持つ最先端のメソッドを$$$$1.1%で上回ることを示した。この研究はMLLMビデオ理解における知覚と世代間のギャップを埋め、高精細なビデオQAアプリケーションのための解釈可能で忠実な推論パラダイムを提供する。

論文の概要: Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

関連論文リスト