Related papers: Representative Action Selection for Large Action Space: From Bandits to MDPs

Representative Action Selection for Large Action Space: From Bandits to MDPs

URL: http://arxiv.org/abs/2511.22104v1
Date: Thu, 27 Nov 2025 04:49:23 GMT
Title: Representative Action Selection for Large Action Space: From Bandits to MDPs
Authors: Quan Zhou, Shie Mannor,
Abstract summary: We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments.<n>Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions.
Score: 47.980675309210746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.

Related papers

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback [59.287761696290865]
We propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback.<n>We demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.
arXiv Detail & Related papers (2026-02-09T06:29:54Z)
DynaAct: Large Language Model Reasoning with Dynamic Action Spaces [58.298135359318024]
We propose a novel framework named textscDynaAct for automatically constructing a compact action space.<n>Our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency.
arXiv Detail & Related papers (2025-11-11T09:47:13Z)
Representative Action Selection for Large Action Space Meta-Bandits [45.81364806019332]
We study the problem of selecting a subset from a large action space shared by a family of bandits.<n>We assume that similar actions tend to have related payoffs, modeled by a Gaussian process.<n>We propose a simple epsilon-net algorithm to select a representative subset.
arXiv Detail & Related papers (2025-05-23T18:08:57Z)
Fair Resource Allocation in Weakly Coupled Markov Decision Processes [3.824858358548714]
We consider fair resource allocation in sequential decision-making environments modeled as weakly coupled Markov decision processes.<n>We adopt a fairness definition using the generalized Gini function instead of the traditional utilitarian (total-sum) objective.
arXiv Detail & Related papers (2024-11-14T20:40:55Z)
EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.<n>We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.<n>Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
Risk-sensitive Markov Decision Process and Learning under General Utility Functions [3.069335774032178]
Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations.<n>We consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a decision process (MDP)<n>We propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward.<n>In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy.
arXiv Detail & Related papers (2023-11-22T18:50:06Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Action Pick-up in Dynamic Action Space Reinforcement Learning [6.15205100319133]
We propose an intelligent Action Pick-up (AP) algorithm to autonomously choose valuable actions that are most likely to boost performance from a set of new actions. In this paper, we first theoretically analyze and find that a prior optimal policy plays an important role in action pick-up by providing useful knowledge and experience. We then design two different AP methods: frequency-based global method and state clustering-based local method, based on the prior optimal policy.
arXiv Detail & Related papers (2023-04-03T10:55:16Z)
Scalable Distributional Robustness in a Class of Non Convex Optimization with Guarantees [7.541571634887807]
Distributionally robust optimization (DRO) has shown robustness in learning as well as sample based problems. We propose a mixed-integer clustering program (MISOCP) which does not scale enough to solve problems with real worldsets.
arXiv Detail & Related papers (2022-05-31T09:07:01Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.