RoboSSM: Scalable In-context Imitation Learning via State-Space Models
- URL: http://arxiv.org/abs/2509.19658v1
- Date: Wed, 24 Sep 2025 00:26:15 GMT
- Title: RoboSSM: Scalable In-context Imitation Learning via State-Space Models
- Authors: Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone,
- Abstract summary: In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations.<n>Recent ICIL methods rely on Transformers, which have computational limitations.<n>We introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models.
- Score: 35.91619896213736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.
Related papers
- Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning [49.82882141491629]
We argue that effective online learning should scale the emphnumber of tasks, rather than the number of samples per task.<n>This regime reveals a structural advantage of model-based reinforcement learning.<n>We instantiate this idea with textbfEfficientZero-Multitask (EZ-M), a sample-efficient multi-task algorithm for online learning.
arXiv Detail & Related papers (2026-03-02T05:07:43Z) - AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making [35.83717913117858]
AntiGrounding is a new framework that reverses the instruction grounding process.<n>It lifts candidate actions directly into the VLM representation space.<n>It renders trajectories from multiple views, and uses structured visual question answering for instruction-based decision making.
arXiv Detail & Related papers (2025-06-14T07:11:44Z) - Birdie: Advancing State Space Models with Reward-Driven Objectives and Curricula [23.071384759427072]
State space models (SSMs) offer advantages over Transformers but struggle with tasks requiring long-range in-context retrieval-like text copying, associative recall, and question answering over long contexts.<n>We propose a novel training procedure, Birdie, that significantly enhances the in-context retrieval capabilities of SSMs without altering their architecture.
arXiv Detail & Related papers (2024-11-01T21:01:13Z) - In-Context Learning Enables Robot Action Prediction in LLMs [52.285739178561705]
We introduce RoboPrompt, a framework that enables offthe-shelf text-only Large Language Models to directly predict robot actions.<n>RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.
arXiv Detail & Related papers (2024-10-16T17:56:49Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic
Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference.
We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z) - VIMA: General Robot Manipulation with Multimodal Prompts [82.01214865117637]
We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts.
We develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks.
We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively.
arXiv Detail & Related papers (2022-10-06T17:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.