SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
- URL: http://arxiv.org/abs/2502.13143v1
- Date: Tue, 18 Feb 2025 18:59:02 GMT
- Title: SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
- Authors: Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi,
- Abstract summary: We introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner.
By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints.
- Score: 49.858348469657784
- License:
- Abstract: Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.
Related papers
- Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models [79.96917782423219]
Orient Anything is the first expert and foundational model designed to estimate object orientation in a single image.
By developing a pipeline to annotate the front face of 3D objects, we collect 2M images with precise orientation annotations.
Our model achieves state-of-the-art orientation estimation accuracy in both rendered and real images.
arXiv Detail & Related papers (2024-12-24T18:58:43Z) - GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.
We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z) - Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning [7.911608620021529]
Multimodal large language models (MLLMs) act as essential interfaces, connecting humans with AI technologies in multimodal applications.
Current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data.
We propose egocentric instruction tuning, which aligns MLLMs' orientation understanding with the user's perspective.
arXiv Detail & Related papers (2024-11-24T15:07:47Z) - HACMan++: Spatially-Grounded Motion Primitives for Manipulation [28.411361363637006]
We introduce spatially-grounded parameterized motion primitives in our method HACMan++.
By grounding the primitives on a spatial location in the environment, our method is able to effectively generalize across object shape and pose variations.
Our approach significantly outperforms existing methods, particularly in complex scenarios demanding both high-level sequential reasoning and object generalization.
arXiv Detail & Related papers (2024-07-11T15:10:14Z) - GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications.
We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS)
Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z) - Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.