Related papers: Unlocking Generalization for Robotics via Modularity and Scale

Unlocking Generalization for Robotics via Modularity and Scale

URL: http://arxiv.org/abs/2503.06814v1
Date: Mon, 10 Mar 2025 00:38:31 GMT
Title: Unlocking Generalization for Robotics via Modularity and Scale
Authors: Murtaza Dalal,
Abstract summary: This thesis seeks to tackle the task of building generalist robot agents by integrating modularity with large-scale learning for general purpose robot control.<n>Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can enforce modularity via planning.<n>To scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data.
Score: 7.650888732318727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How can we build generalist robot systems? Scale may not be enough due to the significant multimodality of robotics tasks, lack of easily accessible data and the challenges of deploying on physical hardware. Meanwhile, most deployed robotic systems today are inherently modular and can leverage the independent generalization capabilities of each module to perform well. Therefore, this thesis seeks to tackle the task of building generalist robot agents by integrating these components into one: combining modularity with large-scale learning for general purpose robot control. The first question we consider is: how can we build modularity and hierarchy into learning systems? Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can enforce modularity via planning to enable more efficient and capable robot learners. Next, we come to the role of scale in building generalist robot systems. To scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data. We leverage a powerful supervision source: classical planning, which can generalize, but is expensive to run and requires access to privileged information to perform well in practice. We use these planners to supervise large-scale policy learning in simulation to produce generalist agents. Finally, we consider how to unify modularity with large-scale policy learning to build real-world robot systems capable of performing zero-shot manipulation. We do so by tightly integrating key ingredients of modular high and mid-level planning, learned local control, procedural scene generation and large-scale policy learning for sim2real transfer. We demonstrate that this recipe can produce a single, generalist agent that can solve challenging long-horizon manipulation tasks in the real world.

Related papers

One For All: LLM-based Heterogeneous Mission Planning in Precision Agriculture [2.9440788521375585]
We present a natural language (NL) robotic mission planner that enables non-specialists to control heterogeneous robots.<n>Our architecture seamlessly translates human language into intermediate descriptions that can be executed by different robotic platforms.<n>This work represents a significant step toward making robotic automation in precision agriculture more accessible to non-technical users.
arXiv Detail & Related papers (2025-06-11T18:45:44Z)
$π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z)
Grounding Robot Policies with Visuomotor Language Guidance [15.774237279917594]
We propose an agent-based framework for grounding robot policies to the current context. The proposed framework is composed of a set of conversational agents designed for specific roles. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z)
Grounding Language Models in Autonomous Loco-manipulation Tasks [3.8363685417355557]
We propose a novel framework that learns, selects, and plans behaviors based on tasks in different scenarios. We leverage the planning and reasoning features of the large language model (LLM), constructing a hierarchical task graph. Experiments in simulation and real-world using the CENTAURO robot show that the language model based planner can efficiently adapt to new loco-manipulation tasks.
arXiv Detail & Related papers (2024-09-02T15:27:48Z)
MeMo: Meaningful, Modular Controllers via Noise Injection [25.541496793132183]
We show that when a new robot is built from the same parts, its control can be quickly learned by reusing the modular controllers. We achieve this with a framework called MeMo which learns (Me)aningful, (Mo)dular controllers. We benchmark our framework in locomotion and grasping environments on simple to complex robot morphology transfer.
arXiv Detail & Related papers (2024-05-24T18:39:20Z)
Octo: An Open-Source Generalist Robot Policy [88.14295917143188]
We introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset. It can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPU. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
arXiv Detail & Related papers (2024-05-20T17:57:01Z)
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation. We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z)
RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z)
MetaMorph: Learning Universal Controllers with Transformers [45.478223199658785]
In robotics we primarily train a single robot for a single task. modular robot systems now allow for the flexible combination of general-purpose building blocks into task optimized morphologies. We propose MetaMorph, a Transformer based approach to learn a universal controller over a modular robot design space.
arXiv Detail & Related papers (2022-03-22T17:58:31Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.