Related papers: BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

URL: http://arxiv.org/abs/2602.15010v2
Date: Wed, 18 Feb 2026 07:07:11 GMT
Title: BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames
Authors: Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, Aviral Kumar,
Abstract summary: Best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks.<n>We analyze why policies latch onto spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training.<n>Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningfuls detected by a vision-language model.
Score: 27.70479413079641
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations. Videos are available at https://bigpicturepolicies.github.io/

Related papers

Prepare Before You Act: Learning From Humans to Rearrange Initial States [4.637185817866919]
Imitation learning (IL) has proven effective across a wide range of manipulation tasks.<n>We propose ReSET, an algorithm that takes initial states and autonomously modifies object poses so that the restructured scene is similar to training data.
arXiv Detail & Related papers (2025-09-22T17:18:52Z)
Exploiting Policy Idling for Dexterous Manipulation [19.909895138745345]
We investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement.<n>Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states.<n>On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance.
arXiv Detail & Related papers (2025-08-21T15:52:45Z)
Learning Long-Context Diffusion Policies via Past-Token Prediction [48.86967836229684]
We propose an alternative approach that explicitly regularizes the retention of past information.<n>We introduce Past-Token Prediction, an auxiliary task in which the policy learns to predict past action tokens alongside future ones.<n> Experiments across four real-world and six simulated tasks demonstrate that our proposed method improves the performance of long-context diffusion policies by 3x and accelerates policy training by more than 10x.
arXiv Detail & Related papers (2025-05-14T17:00:47Z)
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning [8.860366821983211]
STRAP is a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion.<n>This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion.
arXiv Detail & Related papers (2024-12-19T18:54:06Z)
P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task [94.08478298711789]
Embodied Everyday Task is a popular task in the embodied AI community. Natural language instructions often lack explicit task planning. Extensive training is required to equip models with knowledge of the task environment.
arXiv Detail & Related papers (2024-09-17T15:29:34Z)
Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning. We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks. Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z)
Prototype-Sample Relation Distillation: Towards Replay-Free Continual Learning [14.462797749666992]
We propose a holistic approach to jointly learn the representation and class prototypes. We propose a novel distillation loss that constrains class prototypes to maintain relative similarities as compared to new task data. This method yields state-of-the-art performance in the task-incremental setting.
arXiv Detail & Related papers (2023-03-26T16:35:45Z)
COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning [78.13740204156858]
We show that we can reuse prior data to extend new skills simply through dynamic programming. We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands.
arXiv Detail & Related papers (2020-10-27T17:57:29Z)
Belief-Grounded Networks for Accelerated Robot Learning under Partial Observability [13.080765595494213]
We propose a method for policy learning under partial observability called the Belief-Grounded Network (BGN) BGN incentivizes a neural network to concisely summarize its input history. It outperforms all other tested methods and its learned policies work well when transferred onto a physical robot.
arXiv Detail & Related papers (2020-10-19T02:02:21Z)
DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.