Fugu-MT 論文翻訳(概要): Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

論文の概要: Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

arxiv url: http://arxiv.org/abs/2605.26256v1
Date: Mon, 25 May 2026 18:27:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.362937
Title: Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Title（参考訳）: 長期ユーザインタラクションによるマルチモーダル大規模言語モデルエージェントのパーソナライズ
Authors: Jeongeun Lee, Chanyoung Park, Dongha Lee,
Abstract要約: POLARは、長期のユーザインタラクションに対してパーソナライズされたエンボディエージェントのための、メモリ拡張フレームワークである。実施タスクを実行するために、POLARは関連するメモリを取得して現在の要求を解釈し、タスクの実行をガイドする。提案したメモリ機構は,事前のインタラクションに蓄積した情報をより効果的に活用することにより,性能を継続的に向上することを示す。
参考スコア（独自算出の注目度）: 17.9008221917999
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)に基づくエンボディエージェントは,物理環境における複雑なタスクを解く強力な可能性を示している。しかし、パーソナライズされた支援は、汎用的な命令に従うことや、オブジェクトカテゴリを認識すること以上のものを必要としている。現実のシナリオでは、意図されたターゲットは、事前のインタラクションを通じて暗黙的にのみ指定されることが多く、エージェントは時間とともに蓄積されたパーソナライズされたコンテキストを活用する必要がある。本研究では,POLARを提案する。POLARは,長期的ユーザインタラクションに対してパーソナライズされたエンボディエージェントのためのマルチモーダルメモリ拡張フレームワークである。 POLARは、事前のインタラクションを、パーソナライズされたコンテキストや視覚概念のセマンティックメモリをキャプチャするマルチモーダルな知識グラフと、エージェントの軌跡のような具体化された体験のためのエピソードメモリに整理する。実施タスクを実行するために、POLARは関連するメモリを取得して現在の要求を解釈し、タスクの実行をガイドする。複数のMLLMバックボーンにまたがるPOLARの評価と,長期的パーソナライゼーションにおける記憶の役割を検討するための多様な評価シナリオについて検討した。提案したメモリ機構は,事前のインタラクションに蓄積した情報をより効果的に活用することにより,性能を継続的に向上することを示す。エージェントが複数のインタラクションを合理化したり、マルチホップ推論を実行したり、時間とともにユーザ固有のコンテキストの更新を追跡する必要がある場合、特に利得は顕著である。

論文の概要: Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

関連論文リスト