Fugu-MT 論文翻訳(概要): ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

論文の概要: ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

arxiv url: http://arxiv.org/abs/2602.16412v1
Date: Wed, 18 Feb 2026 12:37:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-19 15:58:30.59137
Title: ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
Title（参考訳）: ReMoRa:ロングビデオ理解のためのRefined Motion Representationに基づくマルチモーダル大言語モデル
Authors: Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura,
Abstract要約: 本研究では,大言語モデル(MLLM)による映像理解に焦点を当てた。圧縮表現を直接操作して動画を処理するビデオMLLMであるReMoRaを提案する。本稿では,ReMoRaの長期ビデオ理解ベンチマークを網羅した実験により,ReMoRaの有効性を実証する。
参考スコア（独自算出の注目度）: 12.236081012244533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.
Abstract（参考訳）: マルチモーダルな大規模言語モデル(MLLM)は様々なタスクで顕著な成功を収めてきたが、長いビデオ理解は依然として大きな課題である。本研究では,MLLMによる映像理解に焦点を当てた。この課題は、RGBフレームのフルストリームの処理が計算可能で冗長であり、自己アテンションはシーケンス長の2次複雑さを持つため、困難である。本稿では,圧縮表現を直接操作して動画を処理するビデオMLLMであるReMoRaを提案する。 RGBキーフレームのスパースセットは外観に保持され、時間的ダイナミクスはモーション表現として符号化され、シーケンシャルなRGBフレームは不要となる。これらの動き表現は光フローのコンパクトなプロキシとして機能し、フルフレームデコーディングなしで時間的ダイナミクスをキャプチャする。ブロックをベースとした動きのノイズや忠実度を低減させるため,よりきめ細かな動きを表現するモジュールを導入する。さらに,本モデルでは,これらの特徴を列長と線形にスケールする方法で圧縮する。本稿では,ReMoRaの長期ビデオ理解ベンチマークを網羅した実験により,ReMoRaの有効性を実証する。 ReMoRaは、LongVideoBench、NExT-QA、MLVUなど、複数の挑戦的なベンチマークのベースラインメソッドよりも優れていた。

論文の概要: ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

関連論文リスト