Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining
- URL: http://arxiv.org/abs/2603.01465v1
- Date: Mon, 02 Mar 2026 05:26:29 GMT
- Title: Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining
- Authors: Yipeng Chen, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li, Guoli Yang, Heng Tao Shen,
- Abstract summary: Keyframe-Chaining VLA is a framework that extracts and links key historical frames to model long-horizon dependencies.<n>We design a progress-aware mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase.<n>We introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates.
- Score: 56.62125584296097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.
Related papers
- RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies [54.23445842621374]
Memory is critical for long-horizon and history-dependent robotic manipulation.<n>Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms.<n>We introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models.
arXiv Detail & Related papers (2026-03-04T21:59:32Z) - V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks [6.820118518027692]
V-CAGE is a closed-loop framework for generating semantically aligned manipulation datasets at scale.<n>We propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis.<n>We also employ a hierarchical instruction decomposition module to bridge the gap between abstract intent and low-level control.
arXiv Detail & Related papers (2026-01-21T16:41:51Z) - Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective [16.541717037293278]
We introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability.<n>It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame.<n>We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability.
arXiv Detail & Related papers (2025-11-14T16:56:01Z) - ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context [54.58057019521198]
Leveraging temporal context is crucial for success in partially observable robotic tasks.<n>Prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations.<n>We introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations.
arXiv Detail & Related papers (2025-10-05T15:29:57Z) - CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling [84.51372201195132]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm.<n>CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate.<n>These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
arXiv Detail & Related papers (2025-06-24T17:30:27Z) - FindingDory: A Benchmark to Evaluate Memory in Embodied Agents [49.18498389833308]
We introduce a new benchmark for long-range embodied tasks in the Habitat simulator.<n>This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness.
arXiv Detail & Related papers (2025-06-18T17:06:28Z) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - Introducing Gating and Context into Temporal Action Detection [0.8987776881291144]
Temporal Action Detection (TAD) remains challenging due to action overlaps and variable action durations.
Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism.
We propose a refined feature extraction process through lightweight, yet effective operations.
arXiv Detail & Related papers (2024-09-06T11:52:42Z) - Strike the Balance: On-the-Fly Uncertainty based User Interactions for Long-Term Video Object Segmentation [23.417370317522106]
We introduce a variant of video object segmentation (VOS) that bridges interactive and semi-automatic approaches.
We aim to maximize the tracking duration of an object of interest, while requiring minimal user corrections to maintain tracking over an extended period.
We evaluate our approach using the recently introduced LVOS dataset, which offers numerous long-term videos.
arXiv Detail & Related papers (2024-07-31T21:42:42Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic
Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference.
We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.