Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy
- URL: http://arxiv.org/abs/2510.06515v1
- Date: Tue, 07 Oct 2025 23:26:16 GMT
- Title: Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy
- Authors: Chiara Mignacco, Matthieu Jonckheere, Gilles Stoltz,
- Abstract summary: We propose a reinforcement learning (RL) approach that learns to orchestrate a set of such expert policies.<n>We establish both expectation and high-probability regret guarantees and derive a novel finite-time bias bound for temporal-difference learning.<n>Our results highlight how structured, adaptive learning can improve the modeling and management of complex resource allocation and decision-making processes.
- Score: 5.913458789333235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online matching problems arise in many complex systems, from cloud services and online marketplaces to organ exchange networks, where timely, principled decisions are critical for maintaining high system performance. Traditional heuristics in these settings are simple and interpretable but typically tailored to specific operating regimes, which can lead to inefficiencies when conditions change. We propose a reinforcement learning (RL) approach that learns to orchestrate a set of such expert policies, leveraging their complementary strengths in a data-driven, adaptive manner. Building on the Adv2 framework (Jonckheere et al., 2024), our method combines expert decisions through advantage-based weight updates and extends naturally to settings where only estimated value functions are available. We establish both expectation and high-probability regret guarantees and derive a novel finite-time bias bound for temporal-difference learning, enabling reliable advantage estimation even under constant step size and non-stationary dynamics. To support scalability, we introduce a neural actor-critic architecture that generalizes across large state spaces while preserving interpretability. Simulations on stochastic matching models, including an organ exchange scenario, show that the orchestrated policy converges faster and yields higher system level efficiency than both individual experts and conventional RL baselines. Our results highlight how structured, adaptive learning can improve the modeling and management of complex resource allocation and decision-making processes.
Related papers
- Sample-Efficient Neurosymbolic Deep Reinforcement Learning [49.60927398960061]
We propose a neuro-symbolic Deep RL approach that integrates background symbolic knowledge to improve sample efficiency.<n>Online reasoning is performed to guide the training process through two mechanisms.<n>We show improved performance over a state-of-the-art reward machine baseline.
arXiv Detail & Related papers (2026-01-06T09:28:53Z) - Interpretable by Design: Query-Specific Neural Modules for Explainable Reinforcement Learning [0.3655021726150367]
We architect RL systems as inference engines that can answer diverse queries about their environment.<n>We introduce Query Conditioned Deterministic Inference Networks (QDIN), a unified architecture that treats different types of queries as first-class citizens.<n>Our key empirical finding reveals a fundamental decoupling: inference accuracy can reach near-perfect levels even when control performance remains suboptimal.
arXiv Detail & Related papers (2025-11-11T20:08:32Z) - Optimal Information Combining for Multi-Agent Systems Using Adaptive Bias Learning [0.0]
Current approaches either ignore these biases, leading to suboptimal decisions, or require expensive calibration procedures that are often infeasible in practice.<n>This paper addresses the fundamental question: when can we learn and correct for these unknown biases to recover near-optimal performance?<n>We develop a theoretical framework that decomposes biases into learnable systematic components and irreducible components.<n>We show that systems with high learnability ratios can recover significant performance, while those with low learnability show minimal benefit.
arXiv Detail & Related papers (2025-10-28T21:52:33Z) - Adaptive Reinforcement Learning for Dynamic Configuration Allocation in Pre-Production Testing [4.370892281528124]
We introduce a novel reinforcement learning framework that recasts configuration allocation as a sequential decision-making problem.<n>Our method is the first to integrate Q-learning with a hybrid reward design that fuses simulated outcomes and real-time feedback.
arXiv Detail & Related papers (2025-10-02T05:12:28Z) - Adaptive Approach to Enhance Machine Learning Scheduling Algorithms During Runtime Using Reinforcement Learning in Metascheduling Applications [0.0]
We propose an adaptive online learning unit integrated within the metascheduler to enhance performance in real-time.<n>In the online mode, Reinforcement Learning plays a pivotal role by continuously exploring and discovering new scheduling solutions.<n>Several RL models were implemented within the online learning unit, each designed to address specific challenges in scheduling.
arXiv Detail & Related papers (2025-09-24T19:46:22Z) - STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z) - Meta-learning Structure-Preserving Dynamics [6.088897644268474]
We introduce a modulation-based meta-learning framework that conditions structure-preserving models on compact latent representations of potentially unknown system parameters.<n>We enable scalable and generalizable learning across parametric families of dynamical systems.
arXiv Detail & Related papers (2025-08-15T04:30:27Z) - Scalable In-Context Q-Learning [68.9917436397079]
We propose textbfScalable textbfIn-textbfContext textbfQ-textbfLearning (textbfSICQL) to steer in-context reinforcement learning.<n>textbfSICQL harnesses dynamic programming and world modeling to steer ICRL toward efficient reward and task generalization.
arXiv Detail & Related papers (2025-06-02T04:21:56Z) - Microservices-Based Framework for Predictive Analytics and Real-time Performance Enhancement in Travel Reservation Systems [1.03590082373586]
The paper presents a framework of architecture dedicated to enhancing the performance of real-time travel reservation systems.<n>Our framework includes real-time predictive analytics, through machine learning models, that optimize forecasting customer demand, dynamic pricing, as well as system performance.<n>Future work will be an investigation of advanced AI models and edge processing to further improve the performance and robustness of the systems employed.
arXiv Detail & Related papers (2024-12-20T07:19:42Z) - Continual Task Learning through Adaptive Policy Self-Composition [54.95680427960524]
CompoFormer is a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network.
Our experiments reveal that CompoFormer outperforms conventional continual learning (CL) methods, particularly in longer task sequences.
arXiv Detail & Related papers (2024-11-18T08:20:21Z) - Resilient Constrained Learning [94.27081585149836]
This paper presents a constrained learning approach that adapts the requirements while simultaneously solving the learning task.
We call this approach resilient constrained learning after the term used to describe ecological systems that adapt to disruptions by modifying their operation.
arXiv Detail & Related papers (2023-06-04T18:14:18Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Recursive Experts: An Efficient Optimal Mixture of Learning Systems in
Dynamic Environments [0.0]
Sequential learning systems are used in a wide variety of problems from decision making to optimization.
The goal is to reach an objective by exploiting the temporal relation inherent to the nature's feedback (state)
We propose an efficient optimal mixture framework for general sequential learning systems.
arXiv Detail & Related papers (2020-09-19T15:02:27Z) - Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments.
We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data.
Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.