Related papers: Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation

Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation

URL: http://arxiv.org/abs/2507.18398v1
Date: Thu, 24 Jul 2025 13:31:38 GMT
Title: Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation
Authors: Kwong Ho Li, Wathsala Karunarathne,
Abstract summary: This paper investigates the application of Reinforcement Learning (RL) to optimise call routing in call centres.<n>Two methods are compared: a model-based approach using Value Iteration (VI) under known system dynamics, and a model-free approach using Proximal Policy optimisation (PPO) that learns from experience.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: This paper investigates the application of Reinforcement Learning (RL) to optimise call routing in call centres to minimise client waiting time and staff idle time. Two methods are compared: a model-based approach using Value Iteration (VI) under known system dynamics, and a model-free approach using Proximal Policy Optimisation (PPO) that learns from experience. For the model-based approach, a theoretical model is used, while a simulation model combining Discrete Event Simulation (DES) with the OpenAI Gym environment is developed for model-free learning. Both models frame the problem as a Markov Decision Process (MDP) within a Skills-Based Routing (SBR) framework, with Poisson client arrivals and exponentially distributed service and abandonment times. For policy evaluation, random, VI, and PPO policies are evaluated using the simulation model. After 1,000 test episodes, PPO consistently achives the highest rewards, along with the lowest client waiting time and staff idle time, despite requiring longer training time.

Related papers

$V_0$: A Generalist Value Model for Any Policy at State Zero [80.7505802128501]
Policy methods rely on a baseline to measure the relative advantage of an action.<n>This baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself.<n>We propose a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts.
arXiv Detail & Related papers (2026-02-03T14:35:23Z)
Reinforcement Learning in Queue-Reactive Models: Application to Optimal Execution [0.35932002706017546]
We investigate the use of Reinforcement Learning for the optimal execution of meta-orders.<n>The objective is to execute incrementally large orders while minimizing implementation shortfall and market impact.<n>We employ the Queue-Reactive Model to generate realistic and tractable limit order book simulations.
arXiv Detail & Related papers (2025-11-19T09:26:23Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z)
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z)
Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution. We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z)
Simplified Temporal Consistency Reinforcement Learning [19.814047499837084]
We show that a simple representation learning approach relying on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. Our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.
arXiv Detail & Related papers (2023-06-15T19:37:43Z)
The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms [37.025378882978714]
We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL) Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains.
arXiv Detail & Related papers (2023-03-01T17:42:26Z)
Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm [4.128216503196621]
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner. We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
arXiv Detail & Related papers (2022-10-14T06:53:02Z)
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment. We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z)
On Effective Scheduling of Model-based Reinforcement Learning [53.027698625496015]
We propose a framework named AutoMBPO to automatically schedule the real data ratio. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance.
arXiv Detail & Related papers (2021-11-16T15:24:59Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs [30.36381338938319]
We present two solutions for modeling continuous-time dynamics using neural ordinary differential equations (ODEs) Our models accurately characterize continuous-time dynamics and enable us to develop high-performing policies using a small amount of data. We experimentally demonstrate the efficacy of our methods across various continuous-time domains.
arXiv Detail & Related papers (2020-06-29T17:21:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.