Online Finetuning Decision Transformers with Pure RL Gradients
- URL: http://arxiv.org/abs/2601.00167v1
- Date: Thu, 01 Jan 2026 02:17:18 GMT
- Title: Online Finetuning Decision Transformers with Pure RL Gradients
- Authors: Junkai Luo, Yinglun Zhu,
- Abstract summary: Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making.<n>We propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients.
- Score: 11.215352918313577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decision Transformers (DTs) have emerged as a powerful framework for sequential decision making by formulating offline reinforcement learning (RL) as a sequence modeling problem. However, extending DTs to online settings with pure RL gradients remains largely unexplored, as existing approaches continue to rely heavily on supervised sequence-modeling objectives during online finetuning. We identify hindsight return relabeling -- a standard component in online DTs -- as a critical obstacle to RL-based finetuning: while beneficial for supervised learning, it is fundamentally incompatible with importance sampling-based RL algorithms such as GRPO, leading to unstable training. Building on this insight, we propose new algorithms that enable online finetuning of Decision Transformers using pure reinforcement learning gradients. We adapt GRPO to DTs and introduce several key modifications, including sub-trajectory optimization for improved credit assignment, sequence-level likelihood objectives for enhanced stability and efficiency, and active sampling to encourage exploration in uncertain regions. Through extensive experiments, we demonstrate that our methods outperform existing online DT baselines and achieve new state-of-the-art performance across multiple benchmarks, highlighting the effectiveness of pure-RL-based online finetuning for Decision Transformers.
Related papers
- IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning [13.655904209137006]
We propose textbfImaginary Planning Distillation (IPD), a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference.<n>Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data.<n>By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference.
arXiv Detail & Related papers (2026-03-04T17:05:39Z) - From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification [3.2883573376133555]
StratDiff is a diffusion model to learn prior knowledge from the offline dataset.<n>It refines this knowledge through energy-based functions to improve policy imitation and generate offline-like actions during online fine-tuning.<n>Offline-like samples are updated using offline objectives, while online-like samples follow online learning strategies.
arXiv Detail & Related papers (2025-11-05T19:48:46Z) - Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection [71.84636717632206]
Unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting Internet of Things (IoT) applications.<n>We propose a joint language model (LLM) to learn effective UAV control policies.<n>LLM-CRDT outperforms benchmark online and offline methods, achieving up to 36.7% higher energy efficiency than current state-of-the-art DT approaches.
arXiv Detail & Related papers (2025-09-17T13:05:08Z) - TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning [56.250782426571526]
Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning.<n>We propose a structured template-guided RL framework that augments policy optimization with explicit template guidance.<n>Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training.
arXiv Detail & Related papers (2025-05-21T16:06:10Z) - Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers [111.78179839856293]
Decision Transformers have emerged as a compelling paradigm for offline Reinforcement Learning (RL)
Online finetuning of decision transformers has been surprisingly under-explored.
We find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT.
arXiv Detail & Related papers (2024-10-31T16:38:51Z) - MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot
Learning [52.101643259906915]
We study the problem of offline pre-training and online fine-tuning for reinforcement learning from high-dimensional observations.
Existing model-based offline RL methods are not suitable for offline-to-online fine-tuning in high-dimensional domains.
We propose an on-policy model-based method that can efficiently reuse prior data through model-based value expansion and policy regularization.
arXiv Detail & Related papers (2024-01-06T21:04:31Z) - Rethinking Decision Transformer via Hierarchical Reinforcement Learning [54.3596066989024]
Decision Transformer (DT) is an innovative algorithm leveraging recent advances of the transformer architecture in reinforcement learning (RL)
We introduce a general sequence modeling framework for studying sequential decision making through the lens of Hierarchical RL.
We show DT emerges as a special case of this framework with certain choices of high-level and low-level policies, and discuss the potential failure of these choices.
arXiv Detail & Related papers (2023-11-01T03:32:13Z) - ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles [52.34951901588738]
We propose a novel framework called ENsemble-based Offline-To-Online (ENOTO) RL.
By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance.
Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods.
arXiv Detail & Related papers (2023-06-12T05:10:10Z) - Online Decision Transformer [30.54774566089644]
offline reinforcement learning (RL) can be formulated as a sequence modeling problem.
Online Decision Transformers (ODT) is an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning.
arXiv Detail & Related papers (2022-02-11T13:43:24Z) - MOORe: Model-based Offline-to-Online Reinforcement Learning [26.10368749930102]
We propose a model-based Offline-to-Online Reinforcement learning (MOORe) algorithm.
Experiment results show that our algorithm smoothly transfers from offline to online stages while enabling sample-efficient online adaption.
arXiv Detail & Related papers (2022-01-25T03:14:57Z) - Two-stage Deep Reinforcement Learning for Inverter-based Volt-VAR
Control in Active Distribution Networks [3.260913246106564]
We propose a novel two-stage deep reinforcement learning (DRL) method to improve the voltage profile by regulating inverter-based energy resources.
In the offline stage, a highly efficient adversarial reinforcement learning algorithm is developed to train an offline agent robust to the model mismatch.
In the sequential online stage, we transfer the offline agent safely as the online agent to perform continuous learning and controlling online with significantly improved safety and efficiency.
arXiv Detail & Related papers (2020-05-20T08:02:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.