Fugu-MT 論文翻訳(概要): Watch, Remember, Reason: Human-View Video Understanding with MLLMs

論文の概要: Watch, Remember, Reason: Human-View Video Understanding with MLLMs

arxiv url: http://arxiv.org/abs/2606.07433v1
Date: Fri, 05 Jun 2026 16:29:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.84822
Title: Watch, Remember, Reason: Human-View Video Understanding with MLLMs
Title（参考訳）: MLLMを使って人間の視点で動画を理解できる「Reason」
Authors: Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang,
Abstract要約: ビデオ理解は大規模言語モデル(MLLM)によって急速に変化している本研究は,LLMに基づく映像理解に関する人間視点を提示する。この見解は、タスクを独立したベンチマークとして扱うのではなく、MLLMがエビデンスを取得する方法を分析するための統一された構造を提供する。
参考スコア（独自算出の注目度）: 115.44608894992399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.
Abstract（参考訳）: ビデオ理解は、短いクリップから長いマルチモーダル、知識集約的なビデオシナリオへと研究が進むにつれて、MLLM(Multimodal large language model)によって急速に変化している。これらのシナリオは、限られた計算予算の下でスパースエビデンス、長距離依存、マルチモーダルアライメント、信頼性推論を扱うモデルを必要とする。本研究は,3つの機能的能力(視聴,記憶,推論)を中心に編成された,LLMに基づく映像理解の人間視点を提示する。このビューは、ビデオタスクを独立したベンチマークとして扱うのではなく、ビデオMLLMがエビデンスを取得し、コンテキストを保存し、基底出力を生成する方法を分析するための統一的な構造を提供する。本稿では,映像理解システムの特徴を,その知覚的表現,記憶状態,推論トレース,最終的な予測によって特徴づける定式化を導入する。この定式化に基づいて、時空間認識、効率的な長ビデオ処理、メモリモデリング、ストリーミング理解、忠実な推論の課題を特定する。代表的手法はビデオMLLMシステムにおけるそれらの役割によって構成される。視聴は、きめ細かい、包括的で、オーディオ視覚的で、効率的な知覚をカバーしている。記憶にはオフラインとストリーミングのメモリが含まれるが、推論はテキストのみの推論とビデオによる思考をカバーしている。さらに,エゴセントリック,スポーツ,インストラクショナル,メディカル,ナラティブビデオなどのアプリケーション領域について検討し,タスクタイプ,監督形式,モダリティ,能力範囲のトレーニングデータセットと評価ベンチマークを網羅する。最後に,拡張性,メモリ認識,エビデンスに基づくビデオインテリジェンスに関するオープンな問題と今後の方向性について概説する。関連する作品は、https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.comで継続的にトレースされる。

論文の概要: Watch, Remember, Reason: Human-View Video Understanding with MLLMs

関連論文リスト