Related papers: A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects

A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects

URL: http://arxiv.org/abs/2502.13964v1
Date: Wed, 19 Feb 2025 18:59:17 GMT
Title: A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects
Authors: Arjun Gupta, Rishik Sathua, Saurabh Gupta,
Abstract summary: We develop a closed-loop training-free framework that enables a mobile manipulator to tackle precise tasks involving the manipulation of small objects.<n>SVM employs an RGB-D wrist camera and uses visual servoing for control.<n>We demonstrate that open-vocabulary object detectors can serve as a drop-in module to identify semantic targets.
Score: 16.018172627950857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many everyday mobile manipulation tasks require precise interaction with small objects, such as grasping a knob to open a cabinet or pressing a light switch. In this paper, we develop Servoing with Vision Models (SVM), a closed-loop training-free framework that enables a mobile manipulator to tackle such precise tasks involving the manipulation of small objects. SVM employs an RGB-D wrist camera and uses visual servoing for control. Our novelty lies in the use of state-of-the-art vision models to reliably compute 3D targets from the wrist image for diverse tasks and under occlusion due to the end-effector. To mitigate occlusion artifacts, we employ vision models to out-paint the end-effector thereby significantly enhancing target localization. We demonstrate that aided by out-painting methods, open-vocabulary object detectors can serve as a drop-in module to identify semantic targets (e.g. knobs) and point tracking methods can reliably track interaction sites indicated by user clicks. This training-free method obtains an 85% zero-shot success rate on manipulating unseen objects in novel environments in the real world, outperforming an open-loop control method and an imitation learning baseline trained on 1000+ demonstrations by an absolute success rate of 50%.

Related papers

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks [19.026406684039006]
Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models to learn the mapping between RGB images, language instructions, and joint space control.<n>In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model.<n>Our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$%$.
arXiv Detail & Related papers (2025-05-09T05:32:40Z)
PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes. Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z)
Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors [30.579707929061026]
Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object.
arXiv Detail & Related papers (2024-03-21T16:26:19Z)
Modular Neural Network Policies for Learning In-Flight Object Catching with a Robot Hand-Arm System [55.94648383147838]
We present a modular framework designed to enable a robot hand-arm system to learn how to catch flying objects. Our framework consists of five core modules: (i) an object state estimator that learns object trajectory prediction, (ii) a catching pose quality network that learns to score and rank object poses for catching, (iii) a reaching control policy trained to move the robot hand to pre-catch poses, and (iv) a grasping control policy trained to perform soft catching motions. We conduct extensive evaluations of our framework in simulation for each module and the integrated system, to demonstrate high success rates of in-flight
arXiv Detail & Related papers (2023-12-21T16:20:12Z)
One-shot Imitation Learning via Interaction Warping [32.5466340846254]
We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration. We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances. We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks.
arXiv Detail & Related papers (2023-06-21T17:26:11Z)
Decoupling Skill Learning from Robotic Control for Generalizable Object Manipulation [35.34044822433743]
Recent works in robotic manipulation have shown potential for tackling a range of tasks. We conjecture that this is due to the high-dimensional action space for joint control. In this paper, we take an alternative approach and separate the task of learning 'what to do' from 'how to do it' The whole-body robotic kinematic control is optimized to execute the high-dimensional joint motion to reach the goals in the workspace.
arXiv Detail & Related papers (2023-03-07T16:31:13Z)
Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks. Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z)
Task-Focused Few-Shot Object Detection for Robot Manipulation [1.8275108630751844]
We develop a manipulation method based solely on detection then introduce task-focused few-shot object detection to learn new objects and settings. In experiments for our interactive approach to few-shot learning, we train a robot to manipulate objects directly from detection (ClickBot)
arXiv Detail & Related papers (2022-01-28T21:52:05Z)
V-MAO: Generative Modeling for Multi-Arm Manipulation of Articulated Objects [51.79035249464852]
We present a framework for learning multi-arm manipulation of articulated objects. Our framework includes a variational generative model that learns contact point distribution over object rigid parts for each robot arm.
arXiv Detail & Related papers (2021-11-07T02:31:09Z)
Towards unconstrained joint hand-object reconstruction from RGB videos [81.97694449736414]
Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. We first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions.
arXiv Detail & Related papers (2021-08-16T12:26:34Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
"What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator. We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge. Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.