Related papers: Optimistic Transfer under Task Shift via Bellman Alignment

Optimistic Transfer under Task Shift via Bellman Alignment

URL: http://arxiv.org/abs/2601.21924v1
Date: Thu, 29 Jan 2026 16:16:24 GMT
Title: Optimistic Transfer under Task Shift via Bellman Alignment
Authors: Jinhang Chai, Enpei Zhang, Elynn Chen, Yujun Yan,
Abstract summary: We study online transfer reinforcement learning (RL) in episodic Markov decision processes.<n>We identify one-step Bellman alignment as the correct abstraction for transfer in online RL.<n>We propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch.
Score: 5.192817801536311
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.

Related papers

SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training [54.8494905524997]
Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes.<n>We propose SENTINEL, a verification mechanism for pipeline parallelism (PP) training without duplication.<n>Experiments demonstrate successful training of up to 4B- parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.
arXiv Detail & Related papers (2026-03-03T23:51:10Z)
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z)
Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling [66.0871543682453]
We present Omni-Thinker, a unified reinforcement learning framework that scales large language models across diverse tasks.<n>Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Revisiting the Robustness of the Minimum Error Entropy Criterion: A Transfer Learning Case Study [16.07380451502911]
This paper revisits the robustness of the minimum error entropy criterion to deal with non-Gaussian noises. We investigate its feasibility and usefulness in real-life transfer learning regression tasks, where distributional shifts are common.
arXiv Detail & Related papers (2023-07-17T15:38:11Z)
Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO) Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z)
Transfer Q-learning [46.69861865164196]
Time-inhomogeneous finite-horizon Markov decision processes (MDP) are frequently employed to model decision-making in dynamic treatment regimes.<n>These fields, especially healthcare and business, often face challenges such as high-dimensional state spaces and time-inhomogeneity of the MDP process.<n>We investigate knowledge transfer within time-inhomogeneous finite-horizon MDP by leveraging data from both a target RL task and several related source tasks.
arXiv Detail & Related papers (2022-02-09T20:08:19Z)
Biologically Plausible Training Mechanisms for Self-Supervised Learning in Deep Networks [14.685237010856953]
We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks. We show that learning can be performed with one of two more plausible alternatives to backpagation.
arXiv Detail & Related papers (2021-09-30T12:56:57Z)
The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression [26.5147705530439]
We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method.
arXiv Detail & Related papers (2021-03-09T18:46:01Z)
Towards Accurate Knowledge Transfer via Target-awareness Representation Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED) TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model. Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z)
Forward and inverse reinforcement learning sharing network weights and hyperparameters [3.705785916791345]
ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. A forward RL step minimizes the reverse KL estimated by the inverse RL step. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy.
arXiv Detail & Related papers (2020-08-17T13:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.