Fugu-MT 論文翻訳(概要): MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

論文の概要: MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

arxiv url: http://arxiv.org/abs/2511.02805v1
Date: Tue, 04 Nov 2025 18:27:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:06.147568
Title: MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
Title（参考訳）: MemSearcher: エンドツーエンド強化学習によるLLMの推論、検索、管理のためのトレーニング
Authors: Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han,
Abstract要約: 本稿では,メモリを反復的に保持し,現在のターンと組み合わせたエージェントワークフローであるMemSearcherを提案する。それぞれのターンで、MemSearcherはユーザーの質問をメモリに融合させ、推論トレースを生成し、検索アクションを実行し、メモリを更新してタスクの解決に必要な情報のみを保持する。我々は,MemSearcher Agents の推論,検索戦略,メモリ管理を協調的に最適化する,エンドツーエンドの RL フレームワークである Multi-context GRPO を紹介する。
参考スコア（独自算出の注目度）: 73.27233666920618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
Abstract（参考訳）: 典型的なサーチエージェントは、インタラクション履歴全体をLLMコンテキストに結合し、情報整合性を保つが、長いノイズの多いコンテキストを生成し、高い計算とメモリコストをもたらす。対照的に、現在のターンのみを使用すると、このオーバーヘッドは回避されるが、必須情報を捨てる。このトレードオフは、検索エージェントのスケーラビリティを制限する。この課題に対処するために、我々はMemSearcherというエージェントワークフローを提案し、それは、反復的にコンパクトメモリを保守し、現在のターンとそれを組み合わせている。それぞれのターンで、MemSearcherはユーザーの質問をメモリに融合させ、推論トレースを生成し、検索アクションを実行し、メモリを更新してタスクの解決に必要な情報のみを保持する。この設計は、マルチターン相互作用のコンテキスト長を安定化し、精度を犠牲にすることなく効率を向上する。このワークフローを最適化するために、Multi-context GRPOを導入する。これは、MemSearcher Agentsの推論、検索戦略、メモリ管理を協調的に最適化するエンドツーエンドのRLフレームワークである。特に、マルチコンテキストのGRPOは、異なる文脈下で軌跡のグループをサンプリングし、その中のすべての会話において軌跡レベルの利点を伝播させる。 Search-R1と同じデータセットでトレーニングされたMemSearcherは、7つの公開ベンチマークで強力なベースラインよりも大幅に改善されている。特に、3BベースのMemSearcherは、7Bベースのベースラインよりも優れており、情報の整合性と効率性のバランスを崩すことで、高い精度と低い計算オーバーヘッドをもたらすことが示されている。コードとモデルはhttps://github.com/icip-cas/MemSearcherで公開される。

論文の概要: MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

関連論文リスト