Can We Really Learn One Representation to Optimize All Rewards?
- URL: http://arxiv.org/abs/2602.11399v1
- Date: Wed, 11 Feb 2026 22:06:25 GMT
- Title: Can We Really Learn One Representation to Optimize All Rewards?
- Authors: Chongyi Zheng, Royina Karegoudra Jayanth, Benjamin Eysenbach,
- Abstract summary: We argue that forward-backward representation learning can enable optimal control over arbitrary rewards without further fine-tuning.<n>Our analysis suggests a simplified unsupervised pre-training method for reinforcement learning that, instead of enabling optimal control, performs one step of policy improvement.<n>Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $105$ smaller and improves zero-shot performance by $+24%$ on average.
- Score: 31.057669391671144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method $\textbf{one-step forward-backward representation learning (one-step FB)}$. Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $10^5$ smaller and improves zero-shot performance by $+24\%$ on average. Our project website is available at https://chongyi-zheng.github.io/onestep-fb.
Related papers
- What Can You Do When You Have Zero Rewards During RL? [3.0795668932789515]
Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks.<n>We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components.<n>We find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward.
arXiv Detail & Related papers (2025-10-04T23:10:38Z) - Deep Reinforcement Learning with Gradient Eligibility Traces [28.93284550303061]
In this paper, we extend the generalized $overlinetextPBE$ objective to support multistep credit assignment based on the $lambda$-return.<n>We provide both a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms.<n>We evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments.
arXiv Detail & Related papers (2025-07-12T00:12:05Z) - The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning [37.13807960501503]
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs)<n>We decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR)<n>We show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs.
arXiv Detail & Related papers (2025-06-02T06:10:54Z) - Highway Reinforcement Learning [35.980387097763035]
Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL)
We introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF.
It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large.
arXiv Detail & Related papers (2024-05-28T15:42:45Z) - Fast Propagation is Better: Accelerating Single-Step Adversarial
Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples.
We propose to exploit the interior building blocks of the model to improve efficiency.
Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z) - Task-Robust Pre-Training for Worst-Case Downstream Adaptation [62.05108162160981]
Pre-training has achieved remarkable success when transferred to downstream tasks.
This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks.
arXiv Detail & Related papers (2023-06-21T07:43:23Z) - DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion
Models [97.31200133440308]
We propose using online reinforcement learning to fine-tune text-to-image models.
We focus on diffusion models, defining the fine-tuning task as an RL problem.
Our approach, coined DPOK, integrates policy optimization with KL regularization.
arXiv Detail & Related papers (2023-05-25T17:35:38Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Learning One Representation to Optimize All Rewards [19.636676744015197]
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process.
It provides explicit near-optimal policies for any reward specified a posteriori.
This is a step towards learning controllable agents in arbitrary black-box environments.
arXiv Detail & Related papers (2021-03-14T15:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.