Related papers: Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

URL: http://arxiv.org/abs/2508.04282v1
Date: Wed, 06 Aug 2025 10:13:17 GMT
Title: Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling
Authors: Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li,
Abstract summary: Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms.<n>POMDP environments where agents depend on past observations to make decisions.<n>Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks.
Score: 6.279650855031215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research has developed benchmarks for memory-augmented reinforcement learning (RL) algorithms, providing Partially Observable Markov Decision Process (POMDP) environments where agents depend on past observations to make decisions. While many benchmarks incorporate sufficiently complex real-world problems, they lack controllability over the degree of challenges posed to memory models. In contrast, synthetic environments enable fine-grained manipulation of dynamics, making them critical for detailed and rigorous evaluation of memory-augmented RL. Our study focuses on POMDP synthesis with three key contributions: 1. A theoretical framework for analyzing POMDPs, grounded in Memory Demand Structure (MDS), transition invariance, and related concepts; 2. A methodology leveraging linear process dynamics, state aggregation, and reward redistribution to construct customized POMDPs with predefined properties; 3. Empirically validated series of POMDP environments with increasing difficulty levels, designed based on our theoretical insights. Our work clarifies the challenges of memory-augmented RL in solving POMDPs, provides guidelines for analyzing and designing POMDP environments, and offers empirical support for selecting memory models in RL tasks.

Related papers

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Towards Cost Sensitive Decision Making [14.279123976398926]
In this work, we consider RL models that may actively acquire features from the environment to improve the decision quality and certainty. We propose the Active-Acquisition POMDP and identify two types of the acquisition process for different application domains. In order to assist the agent in the actively-acquired partially-observed environment and alleviate the exploration-exploitation dilemma, we develop a model-based approach.
arXiv Detail & Related papers (2024-10-04T19:48:23Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs) The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
POMDP inference and robust solution via deep reinforcement learning: An application to railway optimal maintenance [0.7046417074932257]
We propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization.
arXiv Detail & Related papers (2023-07-16T15:44:58Z)
GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP, and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making. We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z)
Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning. Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z)
Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty. We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives. These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z)
Memory-based Deep Reinforcement Learning for POMDP [7.137228786549488]
Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3) Our results demonstrate the significant advantages of the memory component in addressing Partially Observable MDPs.
arXiv Detail & Related papers (2021-02-24T15:25:13Z)
DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs [33.07594285100664]
We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience.<n>Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL.
arXiv Detail & Related papers (2020-10-18T00:11:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.