Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach
- URL: http://arxiv.org/abs/2510.18687v1
- Date: Tue, 21 Oct 2025 14:47:08 GMT
- Title: Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach
- Authors: Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, Adam Wierman,
- Abstract summary: Traditional reinforcement learning assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models.<n>In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states.<n>We introduce BOLA, a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions.
- Score: 24.85612231267623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.
Related papers
- Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z) - TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning [63.73629127832652]
We introduce TD-JEPA, which leverages TD-based latent-predictive representations into unsupervised RL.<n> TD-JEPA trains explicit state and task encoders, a policy-conditioned multi-step predictor, and a set of parameterized policies directly in latent space.<n> Empirically, TD-JEPA matches or outperforms state-of-the-art baselines on locomotion, navigation, and manipulation tasks across 13 datasets.
arXiv Detail & Related papers (2025-10-01T10:21:18Z) - Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective [12.655285605773932]
We show that Transformers indeed struggle with high-ambiguity predictions across model sizes.<n>Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference.
arXiv Detail & Related papers (2025-06-19T13:05:12Z) - Prediction-Powered Adaptive Shrinkage Estimation [0.22917707112773592]
Prediction-Powered Adaptive Shrinkage (PAS) is a method that bridges PPI with empirical Bayes shrinkage to improve the estimation of multiple means.<n>PAS adapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications.
arXiv Detail & Related papers (2025-02-20T00:24:05Z) - Predictive Control and Regret Analysis of Non-Stationary MDP with Look-ahead Information [11.679770353558041]
We propose an algorithm designed to achieve low regret in non-stationary MDPs by incorporating look-ahead predictions.
Our theoretical analysis demonstrates that, under certain assumptions, the regret decreases exponentially as the look-ahead window expands.
We validate our approach through simulations, confirming the efficacy of our algorithm in non-stationary environments.
arXiv Detail & Related papers (2024-09-13T00:01:58Z) - Movement-Prediction-Adjusted Naive Forecast [6.935130578959931]
The movement-prediction-adjusted naive forecast (MPANF) is designed to improve point forecasts beyond the naive baseline.<n> MPANF can serve as an effective second-stage method when reliable movement predictions are available.
arXiv Detail & Related papers (2024-06-20T16:32:18Z) - GVFs in the Real World: Making Predictions Online for Water Treatment [23.651798878534635]
We investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant.
We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability.
We show the importance of learning in deployment, by comparing a TD agent trained purely offline with no online updating to a TD agent that learns online.
arXiv Detail & Related papers (2023-12-04T04:49:10Z) - Human Trajectory Forecasting with Explainable Behavioral Uncertainty [63.62824628085961]
Human trajectory forecasting helps to understand and predict human behaviors, enabling applications from social robots to self-driving cars.
Model-free methods offer superior prediction accuracy but lack explainability, while model-based methods provide explainability but cannot predict well.
We show that BNSP-SFM achieves up to a 50% improvement in prediction accuracy, compared with 11 state-of-the-art methods.
arXiv Detail & Related papers (2023-07-04T16:45:21Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models [51.3422222472898]
We document the capability of large language models (LLMs) like ChatGPT to predict stock price movements using news headlines.
We develop a theoretical model incorporating information capacity constraints, underreaction, limits-to-arbitrage, and LLMs.
arXiv Detail & Related papers (2023-04-15T19:22:37Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - Online Policy Optimization for Robust MDP [17.995448897675068]
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go.
In this work, we consider online robust Markov decision process (MDP) by interacting with an unknown nominal system.
We propose a robust optimistic policy optimization algorithm that is provably efficient.
arXiv Detail & Related papers (2022-09-28T05:18:20Z) - Test-time Collective Prediction [73.74982509510961]
Multiple parties in machine learning want to jointly make predictions on future test points.
Agents wish to benefit from the collective expertise of the full set of agents, but may not be willing to release their data or model parameters.
We explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model.
arXiv Detail & Related papers (2021-06-22T18:29:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.