Related papers: ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection

ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection

URL: http://arxiv.org/abs/2310.09069v1
Date: Fri, 13 Oct 2023 12:42:54 GMT
Title: ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection
Authors: Xiaoqi Li, Yanzi Wang, Yan Shen, Ponomarenko Iaroslav, Haoran Lu, Qianxu Wang, Boshi An, Jiaming Liu, Hao Dong
Abstract summary: 3D articulated object manipulation is essential for enabling robots to interact with their environment. Many existing studies make use of 3D point clouds as the primary input for manipulation policies. RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information. This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry.
Score: 10.162882793554191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the realm of future home-assistant robots, 3D articulated object manipulation is essential for enabling robots to interact with their environment. Many existing studies make use of 3D point clouds as the primary input for manipulation policies. However, this approach encounters challenges due to data sparsity and the significant cost associated with acquiring point cloud data, which can limit its practicality. In contrast, RGB images offer high-resolution observations using cost effective devices but lack spatial 3D geometric information. To overcome these limitations, we present a novel image-based robotic manipulation framework. This framework is designed to capture multiple perspectives of the target object and infer depth information to complement its geometry. Initially, the system employs an eye-on-hand RGB camera to capture an overall view of the target object. It predicts the initial depth map and a coarse affordance map. The affordance map indicates actionable areas on the object and serves as a constraint for selecting subsequent viewpoints. Based on the global visual prior, we adaptively identify the optimal next viewpoint for a detailed observation of the potential manipulation success area. We leverage geometric consistency to fuse the views, resulting in a refined depth map and a more precise affordance map for robot manipulation decisions. By comparing with prior works that adopt point clouds or RGB images as inputs, we demonstrate the effectiveness and practicality of our method. In the project webpage (https://sites.google.com/view/imagemanip), real world experiments further highlight the potential of our method for practical deployment.

Related papers

Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning [2.07180164747172]
Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation.<n>This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects.
arXiv Detail & Related papers (2025-07-07T14:11:47Z)
Active 6D Pose Estimation for Textureless Objects using Multi-View RGB Frames [10.859307261818362]
Estimating the 6D pose of textureless objects from RBG images is an important problem in robotics. We propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images.
arXiv Detail & Related papers (2025-03-05T18:28:32Z)
CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World [20.52894595103719]
CordViP is a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks.
arXiv Detail & Related papers (2025-02-12T14:41:14Z)
FusionSense: Bridging Common Sense, Vision, and Touch for Robust Sparse-View Reconstruction [17.367277970910813]
Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. We introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors.
arXiv Detail & Related papers (2024-10-10T18:07:07Z)
3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects [13.58353565350936]
We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot. Our method transforms the estimated geometry into the robot's coordinate frame. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
arXiv Detail & Related papers (2024-07-14T21:02:55Z)
3D Feature Distillation with Object-Centric Priors [9.626027459292926]
2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific or focus on indoor room scan data. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency.
arXiv Detail & Related papers (2024-06-26T20:16:49Z)
Towards Generalizable Multi-Camera 3D Object Detection via Perspective Debiasing [28.874014617259935]
Multi-Camera 3D Object Detection (MC3D-Det) has gained prominence with the advent of bird's-eye view (BEV) approaches. We propose a novel method that aligns 3D detection with 2D camera plane results, ensuring consistent and accurate detections.
arXiv Detail & Related papers (2023-10-17T15:31:28Z)
Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z)
Leveraging Single-View Images for Unsupervised 3D Point Cloud Completion [53.93172686610741]
Cross-PCC is an unsupervised point cloud completion method without requiring any 3D complete point clouds. To take advantage of the complementary information from 2D images, we use a single-view RGB image to extract 2D features. Our method even achieves comparable performance to some supervised methods.
arXiv Detail & Related papers (2022-12-01T15:11:21Z)
SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations [85.38562724999898]
We propose a 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU. Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module and an inter-modal feature interaction module. To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets.
arXiv Detail & Related papers (2021-12-09T03:27:00Z)
Supervised Training of Dense Object Nets using Optimal Descriptors for Industrial Robotic Applications [57.87136703404356]
Dense Object Nets (DONs) by Florence, Manuelli and Tedrake introduced dense object descriptors as a novel visual object representation for the robotics community. In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs. We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.
arXiv Detail & Related papers (2021-02-16T11:40:12Z)
Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation [89.82169646672872]
We propose an unsupervised vision-based system to estimate the joint configurations of the robot arm from a sequence of RGB or RGB-D images without knowing the model a priori. We combine a classical geometric formulation with deep learning and extend the use of epipolar multi-rigid-body constraints to solve this task.
arXiv Detail & Related papers (2020-11-30T20:46:48Z)
A Long Horizon Planning Framework for Manipulating Rigid Pointcloud Objects [25.428781562909606]
We present a framework for solving long-horizon planning problems involving manipulation of rigid objects. Our method plans in the space of object subgoals and frees the planner from reasoning about robot-object interaction dynamics.
arXiv Detail & Related papers (2020-11-16T18:59:33Z)
Single View Metrology in the Wild [94.7005246862618]
We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground. Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights. We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion.
arXiv Detail & Related papers (2020-07-18T22:31:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.