Related papers: Offline Learning for Combinatorial Multi-armed Bandits

Offline Learning for Combinatorial Multi-armed Bandits

URL: http://arxiv.org/abs/2501.19300v1
Date: Fri, 31 Jan 2025 16:56:18 GMT
Title: Offline Learning for Combinatorial Multi-armed Bandits
Authors: Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee-Joe Wong, John C. S. Lui, Wei Chen,
Abstract summary: Off-CMAB is the first offline learning framework for CMAB.<n>Off-CMAB combines pessimistic reward estimations with solvers.<n>Experiments on synthetic and real-world datasets highlight the superior performance of CLCB.
Score: 56.96242764723241
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.

Related papers

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning [60.84707424369494]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks.<n>We introduce the Reasoning Boundary Framework++ (RBF++), a framework for evaluating and optimizing measurable boundaries of CoT capability.
arXiv Detail & Related papers (2025-05-19T16:25:55Z)
Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z)
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
Coordination Failure in Cooperative Offline MARL [3.623224034411137]
We focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data. By using two-player games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms. We propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity.
arXiv Detail & Related papers (2024-07-01T14:51:29Z)
Preference Elicitation for Offline Reinforcement Learning [59.136381500967744]
We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm. Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
arXiv Detail & Related papers (2024-06-26T15:59:13Z)
Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond [58.39457881271146]
We introduce a novel framework of multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT) Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution.
arXiv Detail & Related papers (2024-06-03T14:48:53Z)
LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments [2.2021543101231167]
Modern media firms require automated and efficient methods to identify content that is most engaging and appealing to users. We first investigate the ability of three pure-LLM approaches to identify the catchiest headline: prompt-based methods, embedding-based methods, and fine-tuned open-source LLMs. We then introduce the LLM-Assisted Online Learning Algorithm (LOLA), a novel framework that integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery.
arXiv Detail & Related papers (2024-06-03T07:56:58Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
Low-Latency Federated Learning over Wireless Channels with Differential Privacy [142.5983499872664]
In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server. In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement.
arXiv Detail & Related papers (2021-06-20T13:51:18Z)
Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism [26.11003309805633]
offline reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning and vanilla offline RL. We present a new offline RL framework that smoothly interpolates between the two extremes of data composition.
arXiv Detail & Related papers (2021-03-22T17:27:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.