Related papers: Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

URL: http://arxiv.org/abs/2505.14366v1
Date: Tue, 20 May 2025 13:49:09 GMT
Title: Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds
Authors: Joel Currie, Gioele Migno, Enrico Piacenti, Maria Elena Giannaccini, Patric Bach, Davide De Tommaso, Agnieszka Wykowska,
Abstract summary: We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT)<n>We introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks.<n>This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.
Score: 1.696186398088554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

Related papers

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models [60.675955082094944]
We present IVE, an agentic exploration framework inspired by human curiosity.<n>We evaluate IVE in both simulated and real-world tabletop environments.
arXiv Detail & Related papers (2025-05-12T17:59:11Z)
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [55.26713167507132]
We introduce a generative robotics foundation model that constructs and interprets embodied spaces.<n>EnerVerse employs an autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning.<n>We present EnerVerse-D, a data engine pipeline combining the generative model with 4D Gaussian Splatting, forming a self-reinforcing data loop to reduce the sim-to-real gap.
arXiv Detail & Related papers (2025-01-03T17:00:33Z)
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics [26.42651735582044]
We introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics.<n>It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics.<n>Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.
arXiv Detail & Related papers (2024-11-25T16:21:34Z)
Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks. We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors. The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z)
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris. Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models. We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z)
Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications [14.89043819048682]
We see three core challenges in the future of space robotics that motivate building FM for space robotics.<n>As a firststep towards a space foundation model model, we augment three extraterrestrial databases with fine-grained annotations.<n>We fine-tune a Vision-Language Model to adapt to the semantic features in an extraterrestrial environment.
arXiv Detail & Related papers (2024-08-12T05:07:24Z)
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z)
Teaching Unknown Objects by Leveraging Human Gaze and Augmented Reality in Human-Robot Interaction [3.1473798197405953]
This dissertation aims to teach a robot unknown objects in the context of Human-Robot Interaction (HRI) The combination of eye tracking and Augmented Reality created a powerful synergy that empowered the human teacher to communicate with the robot. The robot's object detection capabilities exhibited comparable performance to state-of-the-art object detectors trained on extensive datasets.
arXiv Detail & Related papers (2023-12-12T11:34:43Z)
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation. We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects. We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z)
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.