Multi-step manipulation task and motion planning guided by video demonstration
- URL: http://arxiv.org/abs/2505.08949v1
- Date: Tue, 13 May 2025 20:27:16 GMT
- Title: Multi-step manipulation task and motion planning guided by video demonstration
- Authors: Kateryna Zorina, David Kovar, Mederic Fourmy, Florent Lamiraux, Nicolas Mansard, Justin Carpentier, Josef Sivic, Vladimir Petrik,
- Abstract summary: This work aims to leverage instructional video to solve complex multi-step task-and-motion planning tasks in robotics.<n>We propose an extension of the well-established Rapidly-Exploring Random Tree (RRT) planner, which simultaneously grows multiple trees around grasp and release states extracted from the guiding video.<n>We demonstrate the effectiveness of our planning algorithm on several robots, including the Franka Emika Panda and the KUKA KMR iiwa.
- Score: 33.01481150518225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to leverage instructional video to solve complex multi-step task-and-motion planning tasks in robotics. Towards this goal, we propose an extension of the well-established Rapidly-Exploring Random Tree (RRT) planner, which simultaneously grows multiple trees around grasp and release states extracted from the guiding video. Our key novelty lies in combining contact states and 3D object poses extracted from the guiding video with a traditional planning algorithm that allows us to solve tasks with sequential dependencies, for example, if an object needs to be placed at a specific location to be grasped later. We also investigate the generalization capabilities of our approach to go beyond the scene depicted in the instructional video. To demonstrate the benefits of the proposed video-guided planning approach, we design a new benchmark with three challenging tasks: (I) 3D re-arrangement of multiple objects between a table and a shelf, (ii) multi-step transfer of an object through a tunnel, and (iii) transferring objects using a tray similar to a waiter transfers dishes. We demonstrate the effectiveness of our planning algorithm on several robots, including the Franka Emika Panda and the KUKA KMR iiwa. For a seamless transfer of the obtained plans to the real robot, we develop a trajectory refinement approach formulated as an optimal control problem (OCP).
Related papers
- Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs [44.52978937479273]
We introduce a multi-robot system that integrates mapping, localization, and task and motion planning (TAMP)<n>Our system builds a shared 3D scene graph incorporating an open-set object-based map, which is leveraged for multi-robot 3D scene graph fusion.<n>We provide an experimental assessment of the performance of our system on real-world tasks in large-scale, outdoor environments.
arXiv Detail & Related papers (2025-06-09T06:02:34Z) - Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning [103.24305074625106]
We propose 3D activity reasoning and planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning.<n>First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions.<n>Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps.
arXiv Detail & Related papers (2025-03-17T09:33:58Z) - A Task and Motion Planning Framework Using Iteratively Deepened AND/OR Graph Networks [3.635602838654497]
We present an approach for integrated task and motion planning based on an AND/OR graph network.<n>We leverage it to implement different classes of task and motion planning problems (TAMP)<n>The approach is evaluated and validated both in simulation and with a real dual-arm robot manipulator, that is, Baxter from Rethink Robotics.
arXiv Detail & Related papers (2025-03-10T17:28:22Z) - SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - Optimal Integrated Task and Path Planning and Its Application to
Multi-Robot Pickup and Delivery [10.530860023128406]
We propose a generic multi-robot planning mechanism that combines an optimal task planner and an optimal path planner.
The Integrated planner, through the interaction of the task planner and the path planner, produces optimal collision-free trajectories for the robots.
arXiv Detail & Related papers (2024-03-02T17:48:40Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Enabling Visual Action Planning for Object Manipulation through Latent
Space Roadmap [72.01609575400498]
We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces.
We propose a Latent Space Roadmap (LSR) for task planning, a graph-based structure capturing globally the system dynamics in a low-dimensional latent space.
We present a thorough investigation of our framework on two simulated box stacking tasks and a folding task executed on a real robot.
arXiv Detail & Related papers (2021-03-03T17:48:26Z) - A Long Horizon Planning Framework for Manipulating Rigid Pointcloud
Objects [25.428781562909606]
We present a framework for solving long-horizon planning problems involving manipulation of rigid objects.
Our method plans in the space of object subgoals and frees the planner from reasoning about robot-object interaction dynamics.
arXiv Detail & Related papers (2020-11-16T18:59:33Z) - Modeling Long-horizon Tasks as Sequential Interaction Landscapes [75.5824586200507]
We present a deep learning network that learns dependencies and transitions across subtasks solely from a set of demonstration videos.
We show that these symbols can be learned and predicted directly from image observations.
We evaluate our framework on two long horizon tasks: (1) block stacking of puzzle pieces being executed by humans, and (2) a robot manipulation task involving pick and place of objects and sliding a cabinet door with a 7-DoF robot arm.
arXiv Detail & Related papers (2020-06-08T18:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.