Fugu-MT 論文翻訳(概要): M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

論文の概要: M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

arxiv url: http://arxiv.org/abs/2606.05008v1
Date: Wed, 03 Jun 2026 15:28:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.857924
Title: M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
Title（参考訳）: M$^3$Eval:認知型ビデオタスクによるマルチモーダルメモリ評価
Authors: Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong,
Abstract要約: M$3$Evalは、マルチモーダルモデルで異なるメモリ次元を探索するための最初の包括的な評価フレームワークとベンチマークである。代表的マルチモーダルモデルにまたがって実験を行い、一貫した弱点と独特な振る舞いを明らかにする。本研究は、メモリを基礎的かつ未探索の能力として強調し、マルチモーダルモデルにおいてより効率的なメモリ機構を設計するための洞察を提供する。
参考スコア（独自算出の注目度）: 19.25978075323521
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.
Abstract（参考訳）: マルチモーダルモデルが長めのビデオ理解へと進むにつれ、メモリは重要な能力として出現する。ビデオデータセットとベンチマークの開発に多大な努力を払ってはいるが、既存の研究は主に知覚と推論に焦点を当てており、メモリを体系的に評価していない。このギャップに対処するため、我々はM$^3$Evalを導入し、マルチモーダルモデルで異なるメモリ次元を探索するための最初の総合的な評価フレームワークとベンチマークを行った。認知心理学に根ざした我々のデザインは、記憶の重要な側面を分離する、注意深く構築されたタスクを特徴付ける。 M$3$Evalを活用することで、代表的マルチモーダルモデルにまたがる広範な実験を行い、一貫した弱点と特異な振る舞いを明らかにする。並列ビデオストリームの処理において,不整合表現の維持に苦慮し,人間の記憶とはかなり異なる干渉パターンを示し,時空間領域よりも空間領域においてより確実な基底記憶源を示し,限られたシンボリックメモリを示す。総合的に、我々のベンチマークは、将来の研究に貴重なリソースを提供する一方、我々の発見は、メモリを基礎的だが未探索の能力として強調し、マルチモーダルモデルにおいてより効率的なメモリ機構を設計するための洞察を提供する。私たちのコードとデータセットはhttps://pku-value-lab.github.io/m3eval-homepage.orgで公開されています。

論文の概要: M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

関連論文リスト