Related papers: An Exploration-free Method for a Linear Stochastic Bandit Driven by a Linear Gaussian Dynamical System

An Exploration-free Method for a Linear Stochastic Bandit Driven by a Linear Gaussian Dynamical System

URL: http://arxiv.org/abs/2504.03926v1
Date: Fri, 04 Apr 2025 20:46:35 GMT
Title: An Exploration-free Method for a Linear Stochastic Bandit Driven by a Linear Gaussian Dynamical System
Authors: Jonathan Gornet, Yilin Mo, Bruno Sinopoli,
Abstract summary: In multi-armed bandits, a major problem the learner faces is the trade-off between exploration and exploitation.<n>In this paper, we introduce a linear bandit setting where the reward is the output of a linear Gaussian dynamical system.<n>We propose Kalman filter Observability Dependent Exploration (KODE), an exploration-free method that utilizes the Kalman filter predictions to select actions.
Score: 0.9217021281095907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In stochastic multi-armed bandits, a major problem the learner faces is the trade-off between exploration and exploitation. Recently, exploration-free methods -- methods that commit to the action predicted to return the highest reward -- have been studied from the perspective of linear bandits. In this paper, we introduce a linear bandit setting where the reward is the output of a linear Gaussian dynamical system. Motivated by a problem encountered in hyperparameter optimization for reinforcement learning, where the number of actions is much higher than the number of training iterations, we propose Kalman filter Observability Dependent Exploration (KODE), an exploration-free method that utilizes the Kalman filter predictions to select actions. Our major contribution of this work is our analysis of the performance of the proposed method, which is dependent on the observability properties of the underlying linear Gaussian dynamical system. We evaluate KODE via two different metrics: regret, which is the cumulative expected difference between the highest possible reward and the reward sampled by KODE, and action alignment, which measures how closely KODE's chosen action aligns with the linear Gaussian dynamical system's state variable. To provide intuition on the performance, we prove that KODE implicitly encourages the learner to explore actions depending on the observability of the linear Gaussian dynamical system. This method is compared to several well-known stochastic multi-armed bandit algorithms to validate our theoretical results.

Related papers

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards.<n>We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration.<n>We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z)
Learning Controlled Stochastic Differential Equations [61.82896036131116]
This work proposes a novel method for estimating both drift and diffusion coefficients of continuous, multidimensional, nonlinear controlled differential equations with non-uniform diffusion. We provide strong theoretical guarantees, including finite-sample bounds for (L2), (Linfty), and risk metrics, with learning rates adaptive to coefficients' regularity. Our method is available as an open-source Python library.
arXiv Detail & Related papers (2024-11-04T11:09:58Z)
Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System [0.0]
Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a multi-armed bandit problem. We propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward.
arXiv Detail & Related papers (2024-05-15T05:33:49Z)
Risk-Sensitive Stochastic Optimal Control as Rao-Blackwellized Markovian Score Climbing [3.9410617513331863]
optimal control of dynamical systems is a crucial challenge in sequential decision-making. Control-as-inference approaches have had considerable success, providing a viable risk-sensitive framework to address the exploration-exploitation dilemma. This paper introduces a novel perspective by framing risk-sensitive control as Markovian reinforcement score climbing under samples drawn from a conditional particle filter.
arXiv Detail & Related papers (2023-12-21T16:34:03Z)
Exploration via linearly perturbed loss minimisation [4.856378369489158]
We introduce exploration via linear loss perturbations (EVILL) for structured bandit problems. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods.
arXiv Detail & Related papers (2023-11-13T18:54:43Z)
Outlier-Insensitive Kalman Filtering: Theory and Applications [26.889182816155838]
We propose a parameter-free algorithm which mitigates harmful effect of outliers while requiring only a short iterative process of the standard update step of the linear Kalman filter.
arXiv Detail & Related papers (2023-09-18T06:33:28Z)
Optimistic Active Exploration of Dynamical Systems [52.91573056896633]
We develop an algorithm for active exploration called OPAX. We show how OPAX can be reduced to an optimal control problem that can be solved at each episode. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.
arXiv Detail & Related papers (2023-06-21T16:26:59Z)
Data-Driven Response Regime Exploration and Identification for Dynamical Systems [0.0]
Data-Driven Response Regime Exploration and Identification (DR$2$EI) is a novel and fully data-driven method for identifying and classifying response regimes of a dynamical system. DR$2$EI utilizes unsupervised learning algorithms to transform the system's response into an embedding space that facilitates regime classification. The performance of the DR$2$EI method was evaluated by analyzing three established dynamical systems.
arXiv Detail & Related papers (2023-04-07T00:11:49Z)
Generative Adversarial Reward Learning for Generalized Behavior Tendency Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling. Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z)
Reinforcement Learning with Fast Stabilization in Linear Dynamical Systems [91.43582419264763]
We study model-based reinforcement learning (RL) in unknown stabilizable linear dynamical systems. We propose an algorithm that certifies fast stabilization of the underlying system by effectively exploring the environment. We show that the proposed algorithm attains $tildemathcalO(sqrtT)$ regret after $T$ time steps of agent-environment interaction.
arXiv Detail & Related papers (2020-07-23T23:06:40Z)
Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.