Fugu-MT 論文翻訳(概要): MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

論文の概要: MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

arxiv url: http://arxiv.org/abs/2510.07915v1
Date: Thu, 09 Oct 2025 08:07:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:14.947882
Title: MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Title（参考訳）: MARC: 効率的なビデオ理解のためのメモリ拡張RLトーケン圧縮
Authors: Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen,
Abstract要約: 構造的検索とRLに基づく蒸留を統合したMARCを提案する。 MARCは1フレームのトークンのみを使用してほぼベースラインの精度を達成する。これにより、リソース制約のある環境での効率的なリアルタイムビデオ理解の可能性を示す。
参考スコア（独自算出の注目度）: 13.02027465520324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な進歩は、マルチモーダルモデルの基礎を築いた。しかしながら、視覚言語モデル(VLM)は、高いフレームレートと長い持続時間のために、画像からビデオへ拡張する際にも、計算コストが重い。トークン圧縮は有望なソリューションであるが、既存のトレーニング不要な方法のほとんどは、情報損失と性能低下を引き起こす。これを解決するために,構造化検索とRLに基づく蒸留を統合した「textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)」を提案する。 MARCは、キークリップを選択するために \textbf{Visual Memory Retriever (VMR) と、教師から学生モデルへの推論能力を排除するために \textbf{Compression Group Relative Policy Optimization (C-GRPO) フレームワークを使用して、 \textit{retrieve-then-compress} 戦略を採用している。 6つのビデオベンチマークの実験によると、MARCは1フレームのトークンのみを使用してほぼベースラインの精度を達成している。これは、ビデオQA、監視、自律運転といったリソース制限された設定において、効率的なリアルタイムビデオ理解の可能性を示している。

論文の概要: MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

関連論文リスト