Related papers: MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

URL: http://arxiv.org/abs/2601.03192v1
Date: Tue, 06 Jan 2026 17:14:50 GMT
Title: MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory
Authors: Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Muning Wen,
Abstract summary: We propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory.<n>MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values.<n>Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.
Score: 46.632646462295234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The hallmark of human intelligence is the ability to master new skills through Constructive Episodic Simulation-retrieving past experiences to synthesize solutions for novel tasks. While Large Language Models possess strong reasoning capabilities, they struggle to emulate this self-evolution: fine-tuning is computationally expensive and prone to catastrophic forgetting, while existing memory-based methods rely on passive semantic matching that often retrieves noise. To address these challenges, we propose MemRL, a framework that enables agents to self-evolve via non-parametric reinforcement learning on episodic memory. MemRL explicitly separates the stable reasoning of a frozen LLM from the plastic, evolving memory. Unlike traditional methods, MemRL employs a Two-Phase Retrieval mechanism that filters candidates by semantic relevance and then selects them based on learned Q-values (utility). These utilities are continuously refined via environmental feedback in an trial-and-error manner, allowing the agent to distinguish high-value strategies from similar noise. Extensive experiments on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench demonstrate that MemRL significantly outperforms state-of-the-art baselines. Our analysis experiments confirm that MemRL effectively reconciles the stability-plasticity dilemma, enabling continuous runtime improvement without weight updates.

Related papers

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models [28.300560850867374]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory.<n>MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.
arXiv Detail & Related papers (2026-02-10T19:16:09Z)
Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z)
Learn Like Humans: Use Meta-cognitive Reflection for Efficient Self-Improvement [12.323590647528247]
We propose Metacognitive Agent Reflective Self-improvement (MARS), a framework that achieves efficient self-evolution within a single recurrence cycle.<n>MARS mimics human learning by integrating principle-based reflection and procedural reflection.<n>Experiments on six benchmarks demonstrate that MARS outperforms state-of-the-art self-evolving systems.
arXiv Detail & Related papers (2026-01-17T09:12:26Z)
Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent [10.571643330948858]
SuperIntelliAgent is an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (the verifier)<n>Unlike conventional supervised fine-tuning, SuperIntelliAgent learns autonomously without annotation.<n>We posit that pairing a trainable learner with a reasoning-capable verifier forms a minimal reliable unit of growing intelligence.
arXiv Detail & Related papers (2025-11-28T18:32:49Z)
From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory [48.22750809620306]
Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving.<n>In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework.<n>We show how context memory enhances the ability of LLMs to utilize information.
arXiv Detail & Related papers (2025-11-11T03:36:33Z)
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting [92.57796055887995]
We introduce ECHO, a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents.<n> ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts.<n>We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation.
arXiv Detail & Related papers (2025-10-11T18:11:09Z)
Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent [6.300669721057781]
Meta-Policy Reflexion (MPR) is a framework that consolidates LLM-generated reflections into a structured, predicate-like Meta-Policy Memory (MPM)<n>MPR externalizes reusable corrective knowledge without model weight updates, enforces domain constraints to reduce unsafe or invalid actions, and retains the adaptability of language-based reflection.<n> Empirical results reported in the supplied material indicate consistent gains in execution accuracy and robustness when compared to Reflexion baselines; rule admissibility further improves stability.
arXiv Detail & Related papers (2025-09-04T08:18:39Z)
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey [103.32591749156416]
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL)<n>This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL.
arXiv Detail & Related papers (2025-09-02T17:46:26Z)
Post-Training Large Language Models via Reinforcement Learning from Self-Feedback [3.73824942136665]
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers.<n>We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward.
arXiv Detail & Related papers (2025-07-29T15:46:26Z)
ReVeal: Self-Evolving Code Agents via Reliable Self-Verification [11.875519107421312]
We introduce ReVeal, a reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation.<n>At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three.<n>These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
arXiv Detail & Related papers (2025-06-13T03:41:04Z)
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.