Related papers: Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback

Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback

URL: http://arxiv.org/abs/2511.10572v2
Date: Fri, 14 Nov 2025 05:15:35 GMT
Title: Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback
Authors: Mohammadsina Almasi, Hadis Anahideh,
Abstract summary: We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback.<n>Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.
Score: 3.0294344089697596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and ethical constraints. However, most learning-based allocation frameworks either assume immediate feedback or ignore the complex interplay between individual characteristics and intervention dynamics. We propose a novel bi-level contextual bandit framework for individualized resource allocation under delayed feedback, designed to operate in real-world settings with dynamic populations, capacity constraints, and time-sensitive impact. At the meta level, the model optimizes subgroup-level budget allocations to satisfy fairness and operational constraints. At the base level, it identifies the most responsive individuals within each group using a neural network trained on observational data, while respecting cooldown windows and delayed treatment effects modeled via resource-specific delay kernels. By explicitly modeling temporal dynamics and feedback delays, the algorithm continually refines its policy as new data arrive, enabling more responsive and adaptive decision-making. We validate our approach on two real-world datasets from education and workforce development, showing that it achieves higher cumulative outcomes, better adapts to delay structures, and ensures equitable distribution across subgroups. Our results highlight the potential of delay-aware, data-driven decision-making systems to improve institutional policy and social welfare.

Related papers

Accelerating Privacy-Preserving Federated Learning in Large-Scale LEO Satellite Systems [57.692181589325116]
Large-scale low-Earth-orbit (LEO) satellite systems are increasingly valued for their ability to enable rapid and wide-area data exchange.<n>Due to privacy concerns and regulatory constraints, raw data collected at remote clients cannot be centrally aggregated.<n>Federated learning offers a privacy-preserving alternative by training local models on distributed devices and exchanging only model parameters.<n>We propose a discrete temporal graph-based on-demand scheduling framework that dynamically allocates communication resources to accelerate federated learning.
arXiv Detail & Related papers (2025-09-05T03:33:42Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
Technical Report: Facilitating the Adoption of Causal Inference Methods Through LLM-Empowered Co-Pilot [44.336297829718795]
We introduce CATE-B, an open-source co-pilot system that uses large language models (LLMs) within an agentic framework to guide users through treatment effect estimation.<n>CATE-B assists in (i) constructing a structural causal model via causal discovery and LLM-based edge orientation, (ii) identifying robust adjustment sets through a novel Minimal Uncertainty Adjustment Set criterion, and (iii) selecting appropriate regression methods tailored to the causal structure and dataset characteristics.
arXiv Detail & Related papers (2025-08-14T12:20:51Z)
Beamforming and Resource Allocation for Delay Minimization in RIS-Assisted OFDM Systems [38.71413228444903]
This paper investigates a joint beamforming and resource allocation problem in downlink reconfigurable intelligent surface (RIS)-assisted OFDM systems.<n>To effectively handle the mixed action space and reduce the state space dimensionality, a hybrid deep reinforcement learning (DRL) approach is proposed.<n>The proposed algorithm significantly reduces the average delay, enhances resource allocation efficiency, and achieves superior system robustness and fairness.
arXiv Detail & Related papers (2025-06-04T05:33:33Z)
What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z)
Active Learning for Fair and Stable Online Allocations [6.23798328186465]
We consider feedback from a select subset of agents at each epoch of the online resource allocation process. Our algorithms provide regret bounds that are sub-linear in number of time-periods for various measures. We show that efficient decision-making does not require extensive feedback and produces efficient outcomes for a variety of problem classes.
arXiv Detail & Related papers (2024-06-20T23:23:23Z)
Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks [86.99017195607077]
We address the challenge of sampling and remote estimation for autoregressive Markovian processes in a wireless network with statistically-identical agents.<n>Our goal is to minimize time-average estimation error and/or age of information with decentralized scalable sampling and transmission policies.
arXiv Detail & Related papers (2024-04-04T06:24:11Z)
Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related Rewards [7.0997346625024]
We formalize a non-stationary and delayed semi-bandit problem with causally related rewards. We develop a policy that learns the structural dependencies from delayed feedback and utilizes that to optimize the decision-making. We evaluate our method via numerical analysis using synthetic and real-world datasets to detect the regions that contribute the most to the spread of Covid-19 in Italy.
arXiv Detail & Related papers (2023-07-18T09:22:33Z)
Stateful Offline Contextual Policy Evaluation and Learning [88.9134799076718]
We study off-policy evaluation and learning from sequential data. We formalize the relevant causal structure of problems such as dynamic personalized pricing. We show improved out-of-sample policy performance in this class of relevant problems.
arXiv Detail & Related papers (2021-10-19T16:15:56Z)
Multi-Agent Online Optimization with Delays: Asynchronicity, Adaptivity, and Optimism [33.116006446428756]
We study multi-agent online learning problems in the presence of delays and asynchronicities. We derive adaptive learning strategies with optimal regret bounds, at both the agent and network levels.
arXiv Detail & Related papers (2020-12-21T18:55:55Z)
Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments. We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data. Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.