Related papers: Fast Value Tracking for Deep Reinforcement Learning

Fast Value Tracking for Deep Reinforcement Learning

URL: http://arxiv.org/abs/2403.13178v1
Date: Tue, 19 Mar 2024 22:18:19 GMT
Title: Fast Value Tracking for Deep Reinforcement Learning
Authors: Frank Shih, Faming Liang,
Abstract summary: Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interact with their environment. Existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards. Our research leverages the Kalman paradigm to introduce a novel quantification and sampling algorithm called Langevinized Kalman TemporalTD.
Score: 7.648784748888187
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interacts with their environment. However, existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards, neglecting the stochastic dynamics of agent-environment interactions and the critical role of uncertainty quantification. Our research leverages the Kalman filtering paradigm to introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.

Related papers

Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics [8.820825533010543]
We present a novel viewpoint on temporal-difference basedReinforcement learning.<n>We show that TD errors can be viewed as 1-cochain in the topological space of state transitions, while Markov dynamics are then interpreted as topological integrability.<n>This novel view enables us to obtain a Hodge-type decomposition of TD errors into an integrable component and a topological residual.
arXiv Detail & Related papers (2026-02-06T18:35:41Z)
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics [72.00014675808228]
Instability in Large Language Models evaluation process obscures true learning dynamics.<n>We introduce textbfMaP, a framework that integrates underlineMerging underlineand the underlinePass@k metric.<n>Experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent rankings.
arXiv Detail & Related papers (2025-10-10T11:40:27Z)
Instance-Dependent Continuous-Time Reinforcement Learning via Maximum Likelihood Estimation [27.232790785138427]
Continuous-time reinforcement learning (CTRL) provides a natural framework for sequential decision-making in dynamic environments.<n>While has shown growing empirical success, its ability to adapt to varying levels of problem difficulty remains poorly understood.<n>In this work, we investigate the instance-dependent behavior of and introduce a simple, model-based algorithm built on maximum likelihood estimation.
arXiv Detail & Related papers (2025-08-04T06:25:45Z)
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z)
Uncertainty quantification for Markov chains with application to temporal difference learning [63.49764856675643]
We develop novel high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains. We analyze the TD learning algorithm, a widely used method for policy evaluation in reinforcement learning.
arXiv Detail & Related papers (2025-02-19T15:33:55Z)
Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence [11.400431211239958]
Diffusion models have emerged as powerful tools for generative modeling. We propose a control framework for fine-tuning diffusion models. We show that PI-FT achieves global convergence at a linear rate.
arXiv Detail & Related papers (2024-12-24T04:55:46Z)
Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z)
Efficient hierarchical Bayesian inference for spatio-temporal regression models in neuroimaging [6.512092052306553]
Examples include M/EEG inverse problems, encoding neural models for task-based fMRI analyses, and temperature monitoring schemes. We devise a novel hierarchical flexible Bayesian framework within which the intrinsic-temporal dynamics of model parameters and noise are modeled.
arXiv Detail & Related papers (2021-11-02T15:50:01Z)
Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via Online High-Confidence Change-Point Detection [7.685002911021767]
We introduce an algorithm that efficiently learns policies in non-stationary environments. It analyzes a possibly infinite stream of data and computes, in real-time, high-confidence change-point detection statistics. We show that (i) this algorithm minimizes the delay until unforeseen changes to a context are detected, thereby allowing for rapid responses.
arXiv Detail & Related papers (2021-05-20T01:57:52Z)
A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions [17.14287157979558]
We propose an adaptively weighted gradient Langevin dynamics (SGLD) for learning in big data statistics. The proposed algorithm is tested on benchmark datasets including CIFAR100.
arXiv Detail & Related papers (2020-10-19T19:20:47Z)
Stochastic Gradient Langevin Dynamics Algorithms with Adaptive Drifts [8.36840154574354]
We propose a class of adaptive gradient Markov chain Monte Carlo (SGMCMC) algorithms, where the drift function is biased to enhance escape from saddle points and the bias is adaptively adjusted according to the gradient of past samples. We demonstrate via numerical examples that the proposed algorithms can significantly outperform the existing SGMCMC algorithms.
arXiv Detail & Related papers (2020-09-20T22:03:39Z)
A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces [53.47210316424326]
KeRNS is an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes. We prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time.
arXiv Detail & Related papers (2020-07-09T21:37:13Z)
An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d) This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z)
Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
Robust Reinforcement Learning with Wasserstein Constraint [49.86490922809473]
We show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm. The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.
arXiv Detail & Related papers (2020-06-01T13:48:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.