Related papers: Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots

Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots

URL: http://arxiv.org/abs/2511.00917v1
Date: Sun, 02 Nov 2025 12:34:37 GMT
Title: Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots
Authors: Junyao Shi, Rujia Yang, Kaitian Chao, Selina Bingqing Wan, Yifei Shao, Jiahui Lei, Jianing Qian, Long Le, Pratik Chaudhari, Kostas Daniilidis, Chuan Wen, Dinesh Jayaraman,
Abstract summary: We build policies around vision-language models (VLMs) by augmenting their general capabilities with specific robot capabilities encapsulated in a curated set of perception, planning, and control modules.<n>In Maestro, a VLM coding agent dynamically composes these modules into a programmatic policy for the current task and scenario.
Score: 54.62646284378409
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today's best-explored routes towards generalist robots center on collecting ever larger "observations-in actions-out" robotics datasets to train large end-to-end models, copying a recipe that has worked for vision-language models (VLMs). We pursue a road less traveled: building generalist policies directly around VLMs by augmenting their general capabilities with specific robot capabilities encapsulated in a carefully curated set of perception, planning, and control modules. In Maestro, a VLM coding agent dynamically composes these modules into a programmatic policy for the current task and scenario. Maestro's architecture benefits from a streamlined closed-loop interface without many manually imposed structural constraints, and a comprehensive and diverse tool repertoire. As a result, it largely surpasses today's VLA models for zero-shot performance on challenging manipulation skills. Further, Maestro is easily extensible to incorporate new modules, easily editable to suit new embodiments such as a quadruped-mounted arm, and even easily adapts from minimal real-world experiences through local code edits.

Related papers

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [83.41714103649751]
Development of embodied intelligence models depends on access to high-quality robot demonstration data.<n>We present HiMoE-VLA, a novel vision-language-action framework tailored to handle diverse robotic data with heterogeneity.<n>HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalizations.
arXiv Detail & Related papers (2025-12-05T13:21:05Z)
Ctrl-World: A Controllable Generative World Model for Robot Manipulation [53.71061464925014]
Generalist robot policies can perform a wide range of manipulation skills.<n> evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge.<n>World models offer a promising, scalable alternative by enabling policies to rollout within imagination space.
arXiv Detail & Related papers (2025-10-11T09:13:10Z)
Latent Action Pretraining Through World Modeling [1.988007188564225]
We propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way.<n>Our framework is designed to be effective for transferring across tasks, environments, and embodiments.
arXiv Detail & Related papers (2025-09-22T21:19:10Z)
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z)
Unlocking Generalization for Robotics via Modularity and Scale [7.650888732318727]
This thesis seeks to tackle the task of building generalist robot agents by integrating modularity with large-scale learning for general purpose robot control.<n>Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can enforce modularity via planning.<n>To scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data.
arXiv Detail & Related papers (2025-03-10T00:38:31Z)
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [62.854649499866774]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation.<n>We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents.<n>We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models [4.4173427917548524]
Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge.
arXiv Detail & Related papers (2023-10-16T09:30:45Z)
Programmatically Grounded, Compositionally Generalizable Robotic Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information. We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions. Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.