Related papers: Reinforcement Learning From State and Temporal Differences

Reinforcement Learning From State and Temporal Differences

URL: http://arxiv.org/abs/2512.08855v1
Date: Tue, 09 Dec 2025 17:48:28 GMT
Title: Reinforcement Learning From State and Temporal Differences
Authors: Lex Weaver, Jonathan Baxter,
Abstract summary: TD($$) with function approximation has proved empirically successful for some complex reinforcement learning problems.<n>We show that it is error in the relative ordering of states that is critical, rather than error in the state values.<n>We present a modified form of TD($$), called STD($$), in which function approximators are trained with respect to relative state values on binary decision problems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.

Related papers

Distributionally Robust Policy Learning under Concept Drifts [33.44768994272614]
This paper studies a more nuanced problem -- robust policy learning under the concept drift.<n>We first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy.<n>We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class.
arXiv Detail & Related papers (2024-12-18T19:53:56Z)
Statistical Learning of Distributionally Robust Stochastic Control in Continuous State Spaces [17.96094201655567]
We explore the control of systems with potentially continuous state and action spaces, characterized by the state dynamics $X_t+1 = f(X_t, A_t, W_t)$. Here, $X$, $A$, and $W$ represent the state, action, and random noise processes, respectively, with $f$ denoting a known function that describes state transitions. This paper introduces a distributionally robust control paradigm that accommodates possibly adversarial perturbation to the noise distribution within a prescribed ambiguity set.
arXiv Detail & Related papers (2024-06-17T07:37:36Z)
Optimization of Time-Dependent Decoherence Rates and Coherent Control for a Qutrit System [77.34726150561087]
Incoherent control makes the decoherence rates depending on time in a specific controlled manner. We consider the problem of maximizing the Hilbert-Schmidt overlap between the system's final state $rho(T)$ and a given target state $rho_rm target.
arXiv Detail & Related papers (2023-08-08T01:28:50Z)
On the Model-Misspecification in Reinforcement Learning [9.864462523050843]
We present a unified theoretical framework for addressing model misspecification in reinforcement learning. We show that value-based and model-based methods can achieve robustness under local misspecification error bounds. We also propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $zeta$.
arXiv Detail & Related papers (2023-06-19T04:31:59Z)
Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space [0.0]
We revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. In our main result, we show that the regret of a slightly-ted version of the classical learning algorithm sc Ucrl2 is in fact upper bounded by $tildemathcalO(sqrtEAT)$ where $E$ is related to the weighted second moment of the stationary measure of a reference policy.
arXiv Detail & Related papers (2023-02-21T13:28:37Z)
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Trusted Approximate Policy Iteration with Bisimulation Metrics [1.6498361958317633]
Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. In this work we first prove that bisimulation metrics can be defined via any $p$-Wasserstein metric for $pgeq 1$. We then describe an approximate policy iteration (API) procedure that uses $epsilon$-aggregation with $pi$-bisimulation and prove performance bounds for continuous state spaces.
arXiv Detail & Related papers (2022-02-06T22:41:56Z)
Minimal Expected Regret in Linear Quadratic Control [79.81807680370677]
We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
arXiv Detail & Related papers (2021-09-29T14:07:21Z)
Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning [99.34907092347733]
We analyze the problem of estimating optimal $Q$-value functions for a discounted Markov decision process with discrete states and actions. Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure. In the other direction, we establish the sharpness of our lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of $Q$-learning.
arXiv Detail & Related papers (2021-06-28T00:38:54Z)
Scattering data and bound states of a squeezed double-layer structure [77.34726150561087]
A structure composed of two parallel homogeneous layers is studied in the limit as their widths $l_j$ and $l_j$, and the distance between them $r$ shrinks to zero simultaneously. The existence of non-trivial bound states is proven in the squeezing limit, including the particular example of the squeezed potential in the form of the derivative of Dirac's delta function. The scenario how a single bound state survives in the squeezed system from a finite number of bound states in the finite system is described in detail.
arXiv Detail & Related papers (2020-11-23T14:40:27Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state. Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.