Fugu-MT 論文翻訳(概要): GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

論文の概要: GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

arxiv url: http://arxiv.org/abs/2605.14498v2
Date: Sat, 16 May 2026 21:14:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:46.000124
Title: GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
Title（参考訳）: GroupMemBench: 多人数会話におけるLLMエージェントメモリのベンチマーク
Authors: Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich,
Abstract要約: 大規模言語モデル(LLM)エージェントは、ますますパーソナルアシスタントや職場の協力者として機能している。既存のメモリシステムとベンチマークは、Dyadicのシングルユーザ設定を中心に構築されている。グループメモリの3つの特性を公開するベンチマークであるGroupMemBenchを紹介する。
参考スコア（独自算出の注目度）: 25.703133924514884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントはますますパーソナルアシスタントや職場協力者として機能し、そのユーティリティは長期にわたる会話を通じて情報を抽出、取得、適用するメモリシステムに依存している。しかしながら、既存のメモリシステムとベンチマークはどちらも、エージェントと対話する複数のユーザがいるグループやチャネルを実際にデプロイしているにも関わらず、Dyadic、シングルユーザ設定を中心に構築されている。このミスマッチは、グループメモリの3つの特性を未測定のまま残している。 (i)1対1のチャットを1対1でまとめる以上のグループダイナミクス。 (II) ユーザ毎のメモリモデリングが必要な話者地上信条追跡 (三)「自由論」が役割特化語彙を創出する「聴衆適応言語」。 3つすべてを公開するベンチマークであるGroupMemBenchを紹介します。グラフ基底合成パイプラインは、制御可能な応答構造と、ユーザごとのペルソナとターゲットオーディエンスに対する各メッセージの条件を備えた多人数会話を生成する。逆クエリパイプラインは、複数のホップ推論、知識更新、用語のあいまいさ、ユーザによる推論、時間的推論、禁忌といった6つのカテゴリにわたる特定のアスカーにすべての質問をバインドし、包括的なメモリ能力を反映した挑戦的で現実的なクエリを反復的に検索する。最強のメモリシステムは平均46.0%に達し、知識の更新は27.1%、曖昧さは37.7%、単純なBM25ベースラインはほとんどのエージェントメモリシステムと一致している。これは、現在のメモリの取り込みによって、グループメモリが依存する構造的特徴と語彙的特徴が消去され、マルチユーザメモリは解決から遠ざかっていることを示している。

論文の概要: GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

関連論文リスト