FORLER: Federated Offline Reinforcement Learning with Q-Ensemble and Actor Rectification
- URL: http://arxiv.org/abs/2602.02055v1
- Date: Mon, 02 Feb 2026 12:57:09 GMT
- Title: FORLER: Federated Offline Reinforcement Learning with Q-Ensemble and Actor Rectification
- Authors: Nan Qiao, Sheng Yue,
- Abstract summary: In Internet-of-Things systems, federated learning has advanced online reinforcement learning (RL) by enabling parallel policy training without sharing raw data.<n>We present FORLER, combining Q-ensemble aggregation on the server with actor rectification on devices.<n>The server robustly merges device Q-functions to curb policy pollution and shift heavy computation off resource-constrained hardware without compromising privacy.
- Score: 5.423004756752519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Internet-of-Things systems, federated learning has advanced online reinforcement learning (RL) by enabling parallel policy training without sharing raw data. However, interacting with real environments online can be risky and costly, motivating offline federated RL (FRL), where local devices learn from fixed datasets. Despite its promise, offline FRL may break down under low-quality, heterogeneous data. Offline RL tends to get stuck in local optima, and in FRL, one device's suboptimal policy can degrade the aggregated model, i.e., policy pollution. We present FORLER, combining Q-ensemble aggregation on the server with actor rectification on devices. The server robustly merges device Q-functions to curb policy pollution and shift heavy computation off resource-constrained hardware without compromising privacy. Locally, actor rectification enriches policy gradients via a zeroth-order search for high-Q actions plus a bespoke regularizer that nudges the policy toward them. A $δ$-periodic strategy further reduces local computation. We theoretically provide safe policy improvement performance guarantees. Extensive experiments show FORLER consistently outperforms strong baselines under varying data quality and heterogeneity.
Related papers
- General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies [4.098989232625628]
offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset.<n>We introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms' learning objectives based on the offline training dataset.<n>Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.
arXiv Detail & Related papers (2026-02-11T17:53:49Z) - Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning [24.46783760408068]
offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction.<n>Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training.<n>We propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances RL and behavior cloning (BC) during training.
arXiv Detail & Related papers (2025-08-27T14:00:18Z) - Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning [64.6334337560557]
Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task.<n>Decision Transformer (DT) struggles to reliably align the actual achieved returns with specified target returns.<n>We propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL.
arXiv Detail & Related papers (2025-08-22T14:30:53Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees [23.838354396418868]
We propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data.
Our approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework.
arXiv Detail & Related papers (2023-11-14T18:45:56Z) - Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online
Reinforcement Learning [71.02384943570372]
Family Offline-to-Online RL (FamO2O) is a framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances.
FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark.
arXiv Detail & Related papers (2023-10-27T08:30:54Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online
Reinforcement Learning [80.25648265273155]
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment.
During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data.
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
arXiv Detail & Related papers (2022-10-25T09:08:26Z) - A Maintenance Planning Framework using Online and Offline Deep
Reinforcement Learning [4.033107207078282]
This paper develops a deep reinforcement learning (DRL) solution to automatically determine an optimal rehabilitation policy for deteriorating water pipes.
We train the agent using deep Q-learning (DQN) to learn an optimal policy with minimal average costs and reduced failure probability.
We demonstrate that DRL-based policies improve over standard preventive, corrective, and greedy planning alternatives.
arXiv Detail & Related papers (2022-08-01T12:41:06Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.