Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors
- URL: http://arxiv.org/abs/2403.14526v1
- Date: Thu, 21 Mar 2024 16:26:19 GMT
- Title: Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors
- Authors: Nikolaos Tsagkas, Jack Rome, Subramanian Ramamoorthy, Oisin Mac Aodha, Chris Xiaoxuan Lu,
- Abstract summary: Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting.
We tackle the problem by framing it as a dense semantic part correspondence task.
Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object.
- Score: 30.579707929061026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: https://tsagkas.github.io/click2grasp
Related papers
- Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [59.87033229815062]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.
Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.
We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - AnyDoor: Zero-shot Object-level Image Customization [63.44307304097742]
This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations.
Our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage.
arXiv Detail & Related papers (2023-07-18T17:59:02Z) - One-shot Imitation Learning via Interaction Warping [32.5466340846254]
We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration.
We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances.
We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks.
arXiv Detail & Related papers (2023-06-21T17:26:11Z) - Affordance Diffusion: Synthesizing Hand-Object Interactions [81.98499943996394]
Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it.
We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object.
arXiv Detail & Related papers (2023-03-21T17:59:10Z) - DisPositioNet: Disentangled Pose and Identity in Semantic Image
Manipulation [83.51882381294357]
DisPositioNet is a model that learns a disentangled representation for each object for the task of image manipulation using scene graphs.
Our framework enables the disentanglement of the variational latent embeddings as well as the feature representation in the graph.
arXiv Detail & Related papers (2022-11-10T11:47:37Z) - Task-Focused Few-Shot Object Detection for Robot Manipulation [1.8275108630751844]
We develop a manipulation method based solely on detection then introduce task-focused few-shot object detection to learn new objects and settings.
In experiments for our interactive approach to few-shot learning, we train a robot to manipulate objects directly from detection (ClickBot)
arXiv Detail & Related papers (2022-01-28T21:52:05Z) - Ab Initio Particle-based Object Manipulation [22.78939235155233]
Particle-based Object Manipulation (Prompt) is a new approach to robot manipulation of novel objects ab initio.
Prompt combines the benefits of both model-based reasoning and data-driven learning.
Prompt successfully handles a variety of everyday objects, some of which are transparent.
arXiv Detail & Related papers (2021-07-19T13:27:00Z) - "What's This?" -- Learning to Segment Unknown Objects from Manipulation
Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator.
We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge.
Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z) - Self-Supervised Object-in-Gripper Segmentation from Robotic Motions [27.915309216800125]
We propose a robust solution for learning to segment unknown objects grasped by a robot.
We exploit motion and temporal cues in RGB video sequences.
Our approach is fully self-supervised and independent of precise camera calibration, 3D models or potentially imperfect depth data.
arXiv Detail & Related papers (2020-02-11T15:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.