Fugu-MT 論文翻訳(概要): Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

論文の概要: Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2603.17541v1
Date: Wed, 18 Mar 2026 09:46:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.618545
Title: Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models
Title（参考訳）: 時間的ゲインと空間的コスト:マルチモーダル大言語モデルにおけるビデオファインチューニングの再考
Authors: Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu,
Abstract要約: 我々は,ビデオSFTがMLLMの視覚能力にどう影響するかを系統的に研究する。 Video-SFTは、ビデオのパフォーマンスを確実に改善するが、静的画像ベンチマークでは、利得や劣化が制限されることが多い。本稿では,フレーム数を適応的に割り当て,映像と映像のトレードオフを部分的に緩和する命令対応ハイブリッドフレーム戦略について検討する。
参考スコア（独自算出の注目度）: 36.34630132055548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.
Abstract（参考訳）: MLLM(Multimodal large language model)は通常、複数のステージで訓練され、視覚的理解を改善するための重要なステップとしてビデオベースの教師付き微調整(Video-SFT)が用いられる。しかし、視覚能力のきめ細かい進化、特に空間的理解と時間的理解のバランスへの影響は、いまだに理解されていない。本稿では,ビデオSFTがMLLMの視覚能力にどう影響するかを系統的に検討する。アーキテクチャ、パラメータスケール、フレームサンプリング設定全体にわたって、一貫したパターンを観察する: Video-SFTは、ビデオのパフォーマンスを確実に改善するが、静的画像ベンチマークでは、限られたゲインや劣化をもたらすことが多い。さらに,このトレードオフは時間的予算と密接に結びついていることを示し,サンプリングフレーム数の増加は一般的にビデオ性能を改善するが,静的画像性能は確実に改善しない。そこで本研究では,映像と映像のトレードオフを緩和し,フレーム数を適応的に割り当てる命令対応ハイブリッドフレーム戦略について検討する。以上の結果から,ビデオSFTはMLLMの無料ランチではなく,空間的理解の維持が画像とビデオの共同訓練における中心的な課題であることが示唆された。

論文の概要: Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

関連論文リスト