Fugu-MT 論文翻訳(概要): Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

論文の概要: Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

arxiv url: http://arxiv.org/abs/2510.05034v1
Date: Mon, 06 Oct 2025 17:10:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:53:00.008867
Title: Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Title（参考訳）: Video-LMMポストトライニング:大規模マルチモーダルモデルによるビデオ推論の深層化
Authors: Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu,
Abstract要約: ビデオ理解はコンピュータビジョンにおける最も困難なフロンティアである。近年,映像理解タスクにおいて,映像多時間モデルが顕著に出現している。 Surveyは、ビデオ-LMM能力を向上するための統一的なフレームワークを研究者や実践者に提供することを目的としている。
参考スコア（独自算出の注目度）: 79.10678768386752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training
Abstract（参考訳）: ビデオ理解はコンピュータビジョンにおける最も困難なフロンティアであり、複雑な時空間関係、長期依存、マルチモーダルエビデンスをモデルに推論する必要がある。近年、ビジュアルエンコーダと強力なデコーダベースの言語モデルを統合するビデオラージマルチモーダルモデル(Video-LMM)が登場し、ビデオ理解タスクにおいて顕著な機能を示した。しかし、これらのモデルを基本的な知覚システムから高度な推論エンジンへと変換する臨界フェーズは、歴史学全体で断片化されている。本調査は,ビデオLMMの訓練後の方法論を包括的に検討し,3つの基本柱を包括的に考察した。本稿では, 時間的局所化, 時空間的接地, 長時間のビデオ効率, マルチモーダル的エビデンス統合といったユニークな課題に対処するため, それらの技術の役割, 相互接続, ビデオ固有の適応を明らかにする構造的分類法を提案する。代表的な手法を体系的に分析し,設計原理,洞察,評価プロトコルを合成し,報酬設計,スケーラビリティ,費用対効果の最適化において重要な課題を特定する。さらに、トレーニング後の有効性の厳密な評価を容易にするために、重要なベンチマーク、データセット、メトリクスをキュレートします。この調査は、ビデオ-LMM能力を向上するための統一的なフレームワークを研究者や実践者に提供することを目的としている。 https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

論文の概要: Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

関連論文リスト