Learning Generalizable Feature Fields for Mobile Manipulation
- URL: http://arxiv.org/abs/2403.07563v2
- Date: Tue, 26 Nov 2024 10:31:15 GMT
- Title: Learning Generalizable Feature Fields for Mobile Manipulation
- Authors: Ri-Zhao Qiu, Yafei Hu, Yuchen Song, Ge Yang, Yang Fu, Jianglong Ye, Jiteng Mu, Ruihan Yang, Nikolay Atanasov, Sebastian Scherer, Xiaolong Wang,
- Abstract summary: We present GeFF, a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time.
We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs.
- Score: 25.155275186849558
- License:
- Abstract: An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF's ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.
Related papers
- SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation [49.858348469657784]
We introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner.
By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints.
arXiv Detail & Related papers (2025-02-18T18:59:02Z) - ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation? [17.356760351203715]
This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects.
We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap.
We significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios.
arXiv Detail & Related papers (2024-12-13T11:22:01Z) - Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors [30.579707929061026]
Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting.
We tackle the problem by framing it as a dense semantic part correspondence task.
Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object.
arXiv Detail & Related papers (2024-03-21T16:26:19Z) - N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field.
We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space.
Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Programmatically Grounded, Compositionally Generalizable Robotic
Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information.
We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions.
Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z) - ALSO: Automotive Lidar Self-supervision by Occupancy estimation [70.70557577874155]
We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds.
The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled.
The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information.
arXiv Detail & Related papers (2022-12-12T13:10:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.