MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
- URL: http://arxiv.org/abs/2601.03192v1
- Date: Tue, 06 Jan 2026 17:14:50 GMT
- Title: MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
- Authors: Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Muning Wen,
- Abstract summary: We propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory.<n>MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values.<n>Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
- Score: 46.632646462295234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
Related papers
- Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models [28.300560850867374]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory.<n>MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.
arXiv Detail & Related papers (2026-02-10T19:16:09Z) - Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - Learn Like Humans: Use Meta-cognitive Reflection for Efficient Self-Improvement [12.323590647528247]
We propose Metacognitive Agent Reflective Self-improvement (MARS), a framework that achieves efficient self-evolution within a single recurrence cycle.<n>MARS mimics human learning by integrating principle-based reflection and procedural reflection.<n>Experiments on six benchmarks demonstrate that MARS outperforms state-of-the-art self-evolving systems.
arXiv Detail & Related papers (2026-01-17T09:12:26Z) - Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent [10.571643330948858]
SuperIntelliAgent is an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (the verifier)<n>Unlike conventional supervised fine-tuning, SuperIntelliAgent learns autonomously without annotation.<n>We posit that pairing a trainable learner with a reasoning-capable verifier forms a minimal reliable unit of growing intelligence.
arXiv Detail & Related papers (2025-11-28T18:32:49Z) - From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory [48.22750809620306]
Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving.<n>In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework.<n>We show how context memory enhances the ability of LLMs to utilize information.
arXiv Detail & Related papers (2025-11-11T03:36:33Z) - Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z) - Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent [6.300669721057781]
Meta-Policy Reflexion (MPR) is a framework that consolidates LLM-generated reflections into a structured, predicate-like Meta-Policy Memory (MPM)<n>MPR externalizes reusable corrective knowledge without model weight updates, enforces domain constraints to reduce unsafe or invalid actions, and retains the adaptability of language-based reflection.<n> Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability.
arXiv Detail & Related papers (2025-09-04T08:18:39Z) - The Landscape of Agentic Reinforcement Learning for LLMs: A Survey [103.32591749156416]
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL)<n>This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL.
arXiv Detail & Related papers (2025-09-02T17:46:26Z) - Post-Training Large Language Models via Reinforcement Learning from Self-Feedback [3.73824942136665]
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers.<n>We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward.
arXiv Detail & Related papers (2025-07-29T15:46:26Z) - ReVeal: Self-Evolving Code Agents via Reliable Self-Verification [11.875519107421312]
We introduce ReVeal, a reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation.<n>At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three.<n>These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
arXiv Detail & Related papers (2025-06-13T03:41:04Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.