Related papers: Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards

Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards

URL: http://arxiv.org/abs/2307.09093v1
Date: Tue, 18 Jul 2023 09:22:33 GMT
Title: Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards
Authors: Saeed Ghoorchian and Setareh Maghsudi
Abstract summary: We formalize a non-stationary and delayed semi-bandit problem with causally related rewards. We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making. We evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.
Score: 7.0997346625024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sequential decision-making under uncertainty is often associated with long feedback delays. Such delays degrade the performance of the learning agent in identifying a subset of arms with the optimal collective reward in the long run. This problem becomes significantly challenging in a non-stationary environment with structural dependencies amongst the reward distributions associated with the arms. Therefore, besides adapting to delays and environmental changes, learning the causal relations alleviates the adverse effects of feedback delay on the decision-making process. We formalize the described setting as a non-stationary and delayed combinatorial semi-bandit problem with causally related rewards. We model the causal relations by a directed graph in a stationary structural equation model. The agent maximizes the long-term average payoff, defined as a linear function of the base arms' rewards. We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making while adapting to drifts. We prove a regret bound for the performance of the proposed algorithm. Besides, we evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.

Related papers

Reinforcement Learning via Conservative Agent for Environments with Random Delays [2.115993069505241]
We propose a simple yet robust agent for decision-making under random delays, termed the conservative agent, which reformulates the random-delay environment into its constant-delay equivalent.<n>This enables any state-of-the-art constant-delay method to be directly extended to the random-delay environments without modifying the algorithmic structure or sacrificing performance.
arXiv Detail & Related papers (2025-07-25T06:41:06Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z)
Efficient Transfer Learning via Causal Bounds [8.981637739384674]
We analyze how causal side-information accelerates online learning, and experiments on data reduction.<n>Our analysis precisely characterizes when how causal side-information accelerates online learning, and experiments on data reduction.
arXiv Detail & Related papers (2023-08-07T13:24:50Z)
Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related Rewards [5.347237827669861]
We study the piecewise stationary semi-bandit problem with causally related rewards. In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process. The problem becomes aggravated in the semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms.
arXiv Detail & Related papers (2023-07-26T12:06:13Z)
Linear Combinatorial Semi-Bandit with Causally Related Rewards [5.347237827669861]
We propose a policy that determines the causal relations by learning the network's topology. We establish a sublinear regret bound for the proposed algorithm.
arXiv Detail & Related papers (2022-12-25T16:05:21Z)
Gated Recurrent Neural Networks with Weighted Time-Delay Feedback [59.125047512495456]
We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism. We show that $tau$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures.
arXiv Detail & Related papers (2022-12-01T02:26:34Z)
Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z)
Break your Bandit Routine with LSD Rewards: a Last Switch Dependent Analysis of Satiation and Seasonality [6.146046338698175]
We introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. We prove an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy.
arXiv Detail & Related papers (2021-10-22T14:53:13Z)
Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions [54.25616645675032]
We study the Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the rewards, and the reward-independent delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings.
arXiv Detail & Related papers (2021-06-04T12:26:06Z)
Stochastic bandits with arm-dependent delays [102.63128271054741]
We propose a simple but efficient UCB-based algorithm called the PatientBandits. We provide both problems-dependent and problems-independent bounds on the regret as well as performance lower bounds.
arXiv Detail & Related papers (2020-06-18T12:13:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.