Related papers: Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

URL: http://arxiv.org/abs/2601.11460v1
Date: Fri, 16 Jan 2026 17:35:00 GMT
Title: Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations
Authors: Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki,
Abstract summary: We introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations.<n>We show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability.
Score: 16.68801520494275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.

Related papers

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning [21.142247150423863]
We propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner.<n>To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets.<n>We show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions can significantly improve performance.
arXiv Detail & Related papers (2025-05-27T09:56:52Z)
Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following [50.377287115281476]
We show that learning to associate the representations of current and future states with a temporal loss can improve compositional generalization.<n>We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.
arXiv Detail & Related papers (2025-02-08T05:26:29Z)
Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation [6.324290412766366]
We propose a knowledge graph-based skill library construction method to organize manipulation knowledge.<n>We also propose a novel hierarchical skill transfer framework based on the skill library and tactile representation.<n> Experiments demonstrate the skill transfer and adaptability capabilities of the proposed methods.
arXiv Detail & Related papers (2024-11-18T16:42:07Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided. The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps. We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z)
Sequential Manipulation Planning on Scene Graph [90.28117916077073]
We devise a 3D scene graph representation, contact graph+ (cg+), for efficient sequential task planning. Goal configurations, naturally specified on contact graphs, can be produced by a genetic algorithm with an optimization method. A task plan is then succinct by computing the Graph Editing Distance (GED) between the initial contact graphs and the goal configurations, which generates graph edit operations corresponding to possible robot actions.
arXiv Detail & Related papers (2022-07-10T02:01:33Z)
Learning Sensorimotor Primitives of Sequential Manipulation Tasks from Visual Demonstrations [13.864448233719598]
This paper describes a new neural network-based framework for learning simultaneously low-level policies and high-level policies. A key feature of the proposed approach is that the policies are learned directly from raw videos of task demonstrations. Empirical results on object manipulation tasks with a robotic arm show that the proposed network can efficiently learn from real visual demonstrations to perform the tasks.
arXiv Detail & Related papers (2022-03-08T01:36:48Z)
Generalizable task representation learning from human demonstration videos: a geometric approach [4.640835690336654]
We study the problem of generalizable task learning from human demonstration videos without extra training on the robot or pre-recorded robot motions. We propose CoVGS-IL, which uses a graphstructured task function to learn task representations under structural constraints.
arXiv Detail & Related papers (2022-02-28T08:25:57Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
Modeling Long-horizon Tasks as Sequential Interaction Landscapes [75.5824586200507]
We present a deep learning network that learns dependencies and transitions across subtasks solely from a set of demonstration videos. We show that these symbols can be learned and predicted directly from image observations. We evaluate our framework on two long horizon tasks: (1) block stacking of puzzle pieces being executed by humans, and (2) a robot manipulation task involving pick and place of objects and sliding a cabinet door with a 7-DoF robot arm.
arXiv Detail & Related papers (2020-06-08T18:07:18Z)
Taskology: Utilizing Task Relations at Scale [28.09712466727001]
We show that we can leverage the inherent relationships among collections of tasks, as they are trained jointly. explicitly utilizing the relationships between tasks allows improving their performance while dramatically reducing the need for labeled data. We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion and ego-motion estimation, and object tracking and 3D detection in point clouds.
arXiv Detail & Related papers (2020-05-14T22:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.