ConceptACT: Episode-Level Concepts for Sample-Efficient Robotic Imitation Learning
- URL: http://arxiv.org/abs/2601.17135v1
- Date: Fri, 23 Jan 2026 19:24:00 GMT
- Title: ConceptACT: Episode-Level Concepts for Sample-Efficient Robotic Imitation Learning
- Authors: Jakob Karalus, Friedhelm Schwenker,
- Abstract summary: ConceptACT is an extension of Action Chunking with Transformers that leverages episode-level semantic concept annotations during training to improve learning efficiency.<n>We integrate concepts using a modified transformer architecture in which the final encoder layer implements concept-aware cross-attention, supervised to align with human annotations.
- Score: 2.9277370836568264
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Imitation learning enables robots to acquire complex manipulation skills from human demonstrations, but current methods rely solely on low-level sensorimotor data while ignoring the rich semantic knowledge humans naturally possess about tasks. We present ConceptACT, an extension of Action Chunking with Transformers that leverages episode-level semantic concept annotations during training to improve learning efficiency. Unlike language-conditioned approaches that require semantic input at deployment, ConceptACT uses human-provided concepts (object properties, spatial relationships, task constraints) exclusively during demonstration collection, adding minimal annotation burden. We integrate concepts using a modified transformer architecture in which the final encoder layer implements concept-aware cross-attention, supervised to align with human annotations. Through experiments on two robotic manipulation tasks with logical constraints, we demonstrate that ConceptACT converges faster and achieves superior sample efficiency compared to standard ACT. Crucially, we show that architectural integration through attention mechanisms significantly outperforms naive auxiliary prediction losses or language-conditioned models. These results demonstrate that properly integrated semantic supervision provides powerful inductive biases for more efficient robot learning.
Related papers
- ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation [55.467742403416175]
We introduce a physics-driven neural algorithm that translates large-scale motion capture to humanoid embodiments.<n>We learn a unified multimodal controller that supports both dense references and sparse task specifications.<n>Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception.
arXiv Detail & Related papers (2026-03-03T18:59:29Z) - Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation [56.996371714721995]
We propose an unsupervised approach that learns a highly compressed two-token state representation.<n>Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models.<n>We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation.
arXiv Detail & Related papers (2025-10-06T17:37:24Z) - Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning [21.142247150423863]
We propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner.<n>To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets.<n>We show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions can significantly improve performance.
arXiv Detail & Related papers (2025-05-27T09:56:52Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.<n>We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.<n>Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - From Real World to Logic and Back: Learning Generalizable Relational Concepts For Long Horizon Robot Planning [16.115874470700113]
We present a method that enables robots to invent symbolic, relational concepts directly from a small number of raw, unsegmented, and unannotated demonstrations.<n>Our framework achieves performance on par with hand-engineered symbolic models, while scaling to execution horizons far beyond training.
arXiv Detail & Related papers (2024-02-19T06:28:21Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - GoferBot: A Visual Guided Human-Robot Collaborative Assembly System [33.649596318580215]
GoferBot is a novel vision-based semantic HRC system for a real-world assembly task.
GoferBot is a novel assembly system that seamlessly integrates all sub-modules by utilising implicit semantic information purely from visual perception.
arXiv Detail & Related papers (2023-04-18T09:09:01Z) - Learning Perceptual Concepts by Bootstrapping from Human Queries [41.07749131023931]
We propose a new approach whereby the robot learns a low-dimensional variant of the concept and uses it to generate a larger data set for learning the concept in the high-dimensional space.
This lets it take advantage of semantically meaningful privileged information only accessible at training time, like object poses and bounding boxes, that allows for richer human interaction to speed up learning.
arXiv Detail & Related papers (2021-11-09T16:43:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.