Reinforcement Learning enhanced Online Adaptive Clinical Decision Support via Digital Twin powered Policy and Treatment Effect optimized Reward
- URL: http://arxiv.org/abs/2508.17212v1
- Date: Sun, 24 Aug 2025 04:51:22 GMT
- Title: Reinforcement Learning enhanced Online Adaptive Clinical Decision Support via Digital Twin powered Policy and Treatment Effect optimized Reward
- Authors: Xinyu Qin, Ruiheng Yu, Lu Wang,
- Abstract summary: We present an online adaptive tool where reinforcement learning provides the policy, a patient digital twin provides the environment, and treatment effect defines the reward.<n> Experiments in a synthetic clinical simulator show low latency, stable throughput, a low expert query rate at fixed safety, and improved return against standard value-based baselines.
- Score: 3.3025649517524793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clinical decision support must adapt online under safety constraints. We present an online adaptive tool where reinforcement learning provides the policy, a patient digital twin provides the environment, and treatment effect defines the reward. The system initializes a batch-constrained policy from retrospective data and then runs a streaming loop that selects actions, checks safety, and queries experts only when uncertainty is high. Uncertainty comes from a compact ensemble of five Q-networks via the coefficient of variation of action values with a $\tanh$ compression. The digital twin updates the patient state with a bounded residual rule. The outcome model estimates immediate clinical effect, and the reward is the treatment effect relative to a conservative reference with a fixed z-score normalization from the training split. Online updates operate on recent data with short runs and exponential moving averages. A rule-based safety gate enforces vital ranges and contraindications before any action is applied. Experiments in a synthetic clinical simulator show low latency, stable throughput, a low expert query rate at fixed safety, and improved return against standard value-based baselines. The design turns an offline policy into a continuous, clinician-supervised system with clear controls and fast adaptation.
Related papers
- Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z) - Benchmarking Offline Multi-Objective Reinforcement Learning in Critical Care [0.07161783472741748]
In critical care settings, clinicians face the challenge of balancing conflicting objectives, primarily maximizing patient survival while minimizing resource utilization.<n>Single-objective Reinforcement Learning approaches typically address this by optimizing a fixed scalarized reward function, resulting in rigid policies that fail to adapt to varying clinical priorities.<n>This paper benchmarks three offline MORL algorithms against three single-objective baselines on the MIMIC-IV dataset.
arXiv Detail & Related papers (2025-12-08T20:09:15Z) - Guardian-regularized Safe Offline Reinforcement Learning for Smart Weaning of Mechanical Circulatory Devices [19.512275639322638]
We study the sequential decision-making problem for automated weaning of mechanical circulatory support devices in cardiogenic shock patients.<n> offline reinforcement learning has proven to be successful in sequential decision-making tasks.<n>We develop an end-to-end machine learning framework with two key contributions.
arXiv Detail & Related papers (2025-11-08T19:32:31Z) - Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL [3.2883573376133555]
We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth transition from offline to online RL.<n>BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated.<n>Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance.
arXiv Detail & Related papers (2025-11-05T18:20:23Z) - Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning [63.79928625391378]
offline reinforcement learning has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making.<n>We propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL.
arXiv Detail & Related papers (2025-08-22T14:30:53Z) - Towards Regulatory-Confirmed Adaptive Clinical Trials: Machine Learning Opportunities and Solutions [59.28853595868749]
We introduce two new objectives for future clinical trials that integrate regulatory constraints and treatment policy value for both the entire population and under-served populations.<n>We formulate Randomize First Augment Next (RFAN), a new framework for designing Phase III clinical trials.<n>Our framework consists of a standard randomized component followed by an adaptive one, jointly meant to efficiently and safely acquire and assign patients into treatment arms during the trial.
arXiv Detail & Related papers (2025-03-12T10:17:54Z) - Distribution-Free Uncertainty Quantification in Mechanical Ventilation Treatment: A Conformal Deep Q-Learning Framework [2.5070297884580874]
This study introduces ConformalDQN, a distribution-free conformal deep Q-learning approach for optimizing mechanical ventilation in intensive care units.<n>We trained and evaluated our model using ICU patient records from the MIMIC-IV database.
arXiv Detail & Related papers (2024-12-17T06:55:20Z) - Differentially Private Distributed Inference [2.4401219403555814]
Healthcare centers collaborating on clinical trials must balance knowledge sharing with safeguarding sensitive patient data.<n>We address this challenge by using differential privacy (DP) to control information leakage.<n>Agents update belief statistics via log-linear rules, and DP noise provides plausible deniability and rigorous performance guarantees.
arXiv Detail & Related papers (2024-02-13T01:38:01Z) - Pruning the Way to Reliable Policies: A Multi-Objective Deep Q-Learning Approach to Critical Care [46.2482873419289]
We introduce a deep Q-learning approach to obtain more reliable critical care policies.
We evaluate our method in off-policy and offline settings using simulated environments and real health records from intensive care units.
arXiv Detail & Related papers (2023-06-13T18:02:57Z) - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online
Reinforcement Learning [80.25648265273155]
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment.
During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data.
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
arXiv Detail & Related papers (2022-10-25T09:08:26Z) - Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites.
We design the first federated policy optimization algorithm for offline RL with sample complexity.
We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z) - Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking.
They can both improve outcomes for study participants and increase the chance of identifying good or even best policies.
To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z) - DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret [59.81290762273153]
Dynamic treatment regimes (DTRs) are personalized, adaptive, multi-stage treatment plans that adapt treatment decisions to an individual's initial features and to intermediate outcomes and features at each subsequent stage.
We propose a novel algorithm that, by carefully balancing exploration and exploitation, is guaranteed to achieve rate-optimal regret when the transition and reward models are linear.
arXiv Detail & Related papers (2020-05-06T13:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.