Learning Long-Context Diffusion Policies via Past-Token Prediction
- URL: http://arxiv.org/abs/2505.09561v2
- Date: Mon, 19 May 2025 20:37:41 GMT
- Title: Learning Long-Context Diffusion Policies via Past-Token Prediction
- Authors: Marcel Torne, Andy Tang, Yuejiang Liu, Chelsea Finn,
- Abstract summary: We propose an alternative approach that explicitly regularizes the retention of past information.<n>We introduce Past-Token Prediction, an auxiliary task in which the policy learns to predict past action tokens alongside future ones.<n> Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3x and accelerates policy training by more than 10x.
- Score: 48.86967836229684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning over long sequences of observations and actions is essential for many robotic tasks. Yet, learning effective long-context policies from demonstrations remains challenging. As context length increases, training becomes increasingly expensive due to rising memory demands, and policy performance often degrades as a result of spurious correlations. Recent methods typically sidestep these issues by truncating context length, discarding historical information that may be critical for subsequent decisions. In this paper, we propose an alternative approach that explicitly regularizes the retention of past information. We first revisit the copycat problem in imitation learning and identify an opposite challenge in recent diffusion policies: rather than over-relying on prior actions, they often fail to capture essential dependencies between past and future actions. To address this, we introduce Past-Token Prediction (PTP), an auxiliary task in which the policy learns to predict past action tokens alongside future ones. This regularization significantly improves temporal modeling in the policy head, with minimal reliance on visual representations. Building on this observation, we further introduce a multistage training strategy: pre-train the visual encoder with short contexts, and fine-tune the policy head using cached long-context embeddings. This strategy preserves the benefits of PTP while greatly reducing memory and computational overhead. Finally, we extend PTP into a self-verification mechanism at test time, enabling the policy to score and select candidates consistent with past actions during inference. Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3x and accelerates policy training by more than 10x.
Related papers
- Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z) - Anytime-valid off-policy inference for contextual bandits [34.721189269616175]
Contextual bandit algorithms map observed contexts $X_t$ to actions $A_t$ over time.
It is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data.
We present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works.
arXiv Detail & Related papers (2022-10-19T17:57:53Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Lifelong Hyper-Policy Optimization with Multiple Importance Sampling
Regularization [40.17392342387002]
We propose an approach which learns a hyper-policy, whose input is time, that outputs the parameters of the policy to be queried at that time.
This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling.
We empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments.
arXiv Detail & Related papers (2021-12-13T13:09:49Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Policy Gradients Incorporating the Future [66.20567145291342]
We introduce a method that allows an agent to "look into the future" without explicitly predicting it.
We propose to allow an agent, during its training on past experience, to observe what emphactually happened in the future at that time.
This gives our agent the opportunity to utilize rich and useful information about the future trajectory dynamics in addition to the present.
arXiv Detail & Related papers (2021-08-04T14:57:11Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL)
Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it.
Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.