Fugu-MT 論文翻訳(概要): TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

論文の概要: TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

arxiv url: http://arxiv.org/abs/2602.18884v1
Date: Sat, 21 Feb 2026 16:10:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.375223
Title: TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models
Title（参考訳）: TPRU:大規模マルチモーダルモデルにおける時間的・手続き的理解の促進
Authors: Zhenkun Gao, Xuhong Wang, Xin Tan, Yuan Xie,
Abstract要約: 本稿では,多様な実施シナリオをベースとした大規模データセットTPRUを紹介する。 TPRUは3つの相補的なタスクを通じて時間的推論を育むために体系的に設計されている。我々は,資源効率の向上を目的とした強化学習(RL)ファインチューニング手法を用いてTPRUを利用する。
参考スコア（独自算出の注目度）: 16.203071396170284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs), particularly smaller, deployable variants, exhibit a critical deficiency in understanding temporal and procedural visual data, a bottleneck hindering their application in real-world embodied AI. This gap is largely caused by a systemic failure in training paradigms, which lack large-scale, procedurally coherent data. To address this problem, we introduce TPRU, a large-scale dataset sourced from diverse embodied scenarios such as robotic manipulation and GUI navigation. TPRU is systematically designed to cultivate temporal reasoning through three complementary tasks: Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review. A key feature is the inclusion of challenging negative samples, compelling models to transition from passive observation to active, cross-modal validation. We leverage TPRU with a reinforcement learning (RL) fine-tuning methodology, specifically targeting the enhancement of resource-efficient models. Experiments show our approach yields dramatic gains: on our manually curated TPRU-Test, the accuracy of TPRU-7B soars from 50.33\% to 75.70\%, a state-of-the-art result that significantly outperforms vastly larger baselines, including GPT-4o. Crucially, these capabilities generalize effectively, demonstrating substantial improvements on established benchmarks. The codebase is available at https://github.com/Stephen-gzk/TPRU/ .
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、特に小型でデプロイ可能な変種であり、時間的および手続き的視覚データを理解する上で重要な欠陥を示しており、現実のAIにおける彼らの応用を妨げるボトルネックとなっている。このギャップは、大規模で手続き的に一貫性のあるデータが欠如している訓練パラダイムの体系的な失敗によって主に引き起こされる。この問題に対処するために,ロボット操作やGUIナビゲーションといった多様な実施シナリオをベースとした大規模データセットTPRUを導入する。 TPRUは、時間的並べ替え、Next-Frame Prediction、Previous-Frame Reviewという3つの補完的なタスクを通じて時間的推論を育むために体系的に設計されている。重要な特徴は、受動的観察からアクティブなクロスモーダルバリデーションに移行するための、挑戦的な負のサンプルの導入である。我々は,資源効率の向上を目的とした強化学習(RL)ファインチューニング手法を用いてTPRUを利用する。我々の手作業によるTPRU-Testでは、TPRU-7Bの精度が50.33\%から75.70\%に上昇し、GPT-4oを含む非常に大きなベースラインを著しく上回りました。重要な点として、これらの機能は効果的に一般化され、確立されたベンチマークを大幅に改善した。コードベースはhttps://github.com/Stephen-gzk/TPRU/で公開されている。

論文の概要: TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

関連論文リスト