HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
- URL: http://arxiv.org/abs/2510.00695v2
- Date: Thu, 02 Oct 2025 06:41:44 GMT
- Title: HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
- Authors: Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin,
- Abstract summary: HAMLET is a framework to adapt Vision-Language-Action models to attend to the historical context during action prediction.<n>We show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy.<n>On top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks.
- Score: 61.668591984635846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.
Related papers
- Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining [56.62125584296097]
Keyframe-Chaining VLA is a framework that extracts and links key historical frames to model long-horizon dependencies.<n>We design a progress-aware mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase.<n>We introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates.
arXiv Detail & Related papers (2026-03-02T05:26:29Z) - DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models [25.91822750707556]
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation.<n>VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'<n>This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks.
arXiv Detail & Related papers (2026-01-22T16:02:56Z) - Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents [2.027211672314502]
Current vision--action models generalize poorly when tasks require new compositions of skills or objects.<n>We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions.<n>AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence.
arXiv Detail & Related papers (2025-12-12T14:14:27Z) - Don't Just Chase "Highlighted Tokens" in MLLMs: Revisiting Visual Holistic Context Retention [50.97683288777336]
Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens.<n>Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention.<n>We propose HoloV, a plug-and-play visual token pruning framework for efficient inference.
arXiv Detail & Related papers (2025-10-03T11:33:40Z) - MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation [59.31354761628506]
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it.<n>We propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation.<n>We evaluate it on 150+ simulation and real-world tasks across three robots.
arXiv Detail & Related papers (2025-08-26T17:57:16Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning.
By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z) - HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization [3.187381965457262]
We introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL.
By integrating historical context, our framework enhances the synergy between long-term and short-term information.
We evaluate our model on both procedural egocentric (PREGO) datasets and standard non-PREGO OnTAL datasets.
arXiv Detail & Related papers (2024-08-12T18:29:48Z) - The future is different: Large pre-trained language models fail in
prediction tasks [2.9005223064604078]
We introduce four new REDDIT datasets, namely the WALLSTREETBETS, ASKSCIENCE, THE DONALD, and POLITICS sub-reddits.
First, we empirically demonstrate that LPLM can display average performance drops of about 88% when predicting the popularity of future posts from sub-reddits whose topic distribution changes with time.
We then introduce a simple methodology that leverages neural variational dynamic topic models and attention mechanisms to infer temporal language model representations for regression tasks.
arXiv Detail & Related papers (2022-11-01T11:01:36Z) - FCM: Forgetful Causal Masking Makes Causal Language Models Better
Zero-Shot Learners [139.6321017962092]
We propose a simple technique that significantly boosts the performance of large language models without adding computational cost.
Our key observation is that, by performing the next token prediction task with randomly selected past tokens masked out, we can improve the quality of the learned representations.
Experimental results show that our method also improves PaLM's zero and few-shot performance on a diverse suite of tasks.
arXiv Detail & Related papers (2022-10-24T17:46:57Z) - Listen Attentively, and Spell Once: Whole Sentence Generation via a
Non-Autoregressive Architecture for Low-Latency Speech Recognition [66.47000813920619]
We propose a non-autoregressive end-to-end speech recognition system called LASO.
Because of the non-autoregressive property, LASO predicts a textual token in the sequence without the dependence on other tokens.
We conduct experiments on publicly available Chinese dataset AISHELL-1.
arXiv Detail & Related papers (2020-05-11T04:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.