Using Non-Stationary Bandits for Learning in Repeated Cournot Games with
Non-Stationary Demand
- URL: http://arxiv.org/abs/2201.00486v1
- Date: Mon, 3 Jan 2022 05:51:47 GMT
- Title: Using Non-Stationary Bandits for Learning in Repeated Cournot Games with
Non-Stationary Demand
- Authors: Kshitija Taywade, Brent Harrison, Judy Goldsmith
- Abstract summary: In this paper, we model repeated Cournot games with non-stationary demand.
The set of arms/actions that an agent can choose from represents discrete production quantities.
We propose a novel algorithm 'Adaptive with Weighted Exploration (AWE) $epsilon$-greedy' which is remotely based on the well-known $epsilon$-greedy approach.
- Score: 11.935419090901524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many past attempts at modeling repeated Cournot games assume that demand is
stationary. This does not align with real-world scenarios in which market
demands can evolve over a product's lifetime for a myriad of reasons. In this
paper, we model repeated Cournot games with non-stationary demand such that
firms/agents face separate instances of non-stationary multi-armed bandit
problem. The set of arms/actions that an agent can choose from represents
discrete production quantities; here, the action space is ordered. Agents are
independent and autonomous, and cannot observe anything from the environment;
they can only see their own rewards after taking an action, and only work
towards maximizing these rewards. We propose a novel algorithm 'Adaptive with
Weighted Exploration (AWE) $\epsilon$-greedy' which is remotely based on the
well-known $\epsilon$-greedy approach. This algorithm detects and quantifies
changes in rewards due to varying market demand and varies learning rate and
exploration rate in proportion to the degree of changes in demand, thus
enabling agents to better identify new optimal actions. For efficient
exploration, it also deploys a mechanism for weighing actions that takes
advantage of the ordered action space. We use simulations to study the
emergence of various equilibria in the market. In addition, we study the
scalability of our approach in terms number of total agents in the system and
the size of action space. We consider both symmetric and asymmetric firms in
our models. We found that using our proposed method, agents are able to swiftly
change their course of action according to the changes in demand, and they also
engage in collusive behavior in many simulations.
Related papers
- Competing Bandits in Decentralized Large Contextual Matching Markets [13.313881962771777]
We study decentralized learning in two-sided matching markets where the demand side (aka players or agents) competes for a large' supply side (aka arms)
Our proposed algorithms achieve instance-dependent logarithmic regret, scaling independently of the number of arms, $K$.
arXiv Detail & Related papers (2024-11-18T18:08:05Z) - On Imperfect Recall in Multi-Agent Influence Diagrams [57.21088266396761]
Multi-agent influence diagrams (MAIDs) are a popular game-theoretic model based on Bayesian networks.
We show how to solve MAIDs with forgetful and absent-minded agents using mixed policies and two types of correlated equilibrium.
We also describe applications of MAIDs to Markov games and team situations, where imperfect recall is often unavoidable.
arXiv Detail & Related papers (2023-07-11T07:08:34Z) - A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning [53.83345471268163]
We investigate learning the equilibria in non-stationary multi-agent systems.
We show how to test for various types of equilibria by a black-box reduction to single-agent learning.
arXiv Detail & Related papers (2023-06-12T23:48:24Z) - Stochastic Market Games [10.979093424231532]
We propose to utilize market forces to provide incentives for agents to become cooperative.
As demonstrated in an iterated version of the Prisoner's Dilemma, the proposed market formulation can change the dynamics of the game.
We empirically find that the presence of markets can improve both the overall result and agent individual returns via their trading activities.
arXiv Detail & Related papers (2022-07-15T10:37:16Z) - Dynamic Memory for Interpretable Sequential Optimisation [0.0]
We present a solution to handling non-stationarity that is suitable for deployment at scale.
We develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory.
We describe the architecture of a large-scale deployment of automatic-as-a-service.
arXiv Detail & Related papers (2022-06-28T12:29:13Z) - Uplifting Bandits [23.262188897812475]
We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them.
This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers.
We propose UCB-style algorithms that estimate the uplifts of the actions over a baseline.
arXiv Detail & Related papers (2022-06-08T18:00:56Z) - Off-Beat Multi-Agent Reinforcement Learning [62.833358249873704]
We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent.
We propose a novel episodic memory, LeGEM, for model-free MARL algorithms.
We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2022-05-27T02:21:04Z) - Modelling Cournot Games as Multi-agent Multi-armed Bandits [4.751331778201811]
We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games.
We find that an $epsilon$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches.
We propose two novel approaches that take advantage of the ordered action space: $epsilon$-greedy+HL and $epsilon$-greedy+EL.
arXiv Detail & Related papers (2022-01-01T22:02:47Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Robust Allocations with Diversity Constraints [65.3799850959513]
We show that the Nash Welfare rule that maximizes product of agent values is uniquely positioned to be robust when diversity constraints are introduced.
We also show that the guarantees achieved by Nash Welfare are nearly optimal within a widely studied class of allocation rules.
arXiv Detail & Related papers (2021-09-30T11:09:31Z) - Markov Decision Process modeled with Bandits for Sequential Decision
Making in Linear-flow [73.1896399783641]
In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence.
We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix.
We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
arXiv Detail & Related papers (2021-07-01T03:54:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.