Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
- URL: http://arxiv.org/abs/2510.17801v1
- Date: Mon, 20 Oct 2025 17:59:03 GMT
- Title: Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
- Authors: Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, Ruichuan An, Kun Wu, Zhengping Che, Shaoxuan Xie, Guocai Yao, Zhongxia Zhao, Pengwei Wang, Guang Liu, Zhongyuan Wang, Tiejun Huang, Shanghang Zhang,
- Abstract summary: Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge.<n>Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control.<n>We introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains.
- Score: 62.01012517796797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
Related papers
- RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation [80.20970723577818]
We introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation.<n>The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences.<n>Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations.
arXiv Detail & Related papers (2025-06-07T06:15:49Z) - Think, Reflect, Create: Metacognitive Learning for Zero-Shot Robotic Planning with LLMs [3.0067862210362284]
Large language models (LLMs) have shown great potential across various domains.<n>We present a framework that integrates metacognitive learning into LLM-powered multi-robot collaboration.<n>We propose a more challenging robotic benchmark task and evaluate our framework on the existing benchmark and the novel task.
arXiv Detail & Related papers (2025-05-20T20:51:18Z) - REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation [57.628771707989166]
We propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution.<n>ReMAC incorporates two key modules: a self-reflection module performing pre-conditions and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning.
arXiv Detail & Related papers (2025-03-28T03:51:40Z) - EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z) - EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios.<n>We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning.<n>Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z) - RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World [18.44952696124717]
RoboMatrix is a skill-centric hierarchical framework designed for scalable robot task planning and execution in open-world environments.<n>Key innovation of our work is the introduction of the first unified vision-language-action (VLA) model capable of seamlessly integrating both movement and manipulation within one model.
arXiv Detail & Related papers (2024-11-29T17:36:03Z) - REFLECT: Summarizing Robot Experiences for Failure Explanation and
Correction [28.015693808520496]
REFLECT is a framework which queries Large Language Models for failure reasoning based on a hierarchical summary of robot past experiences.
We show that REFLECT is able to generate informative failure explanations that assist successful correction planning.
arXiv Detail & Related papers (2023-06-27T18:03:15Z) - LLM as A Robotic Brain: Unifying Egocentric Memory and Control [77.0899374628474]
Embodied AI focuses on the study and development of intelligent systems that possess a physical or virtual embodiment (i.e. robots)
Memory and control are the two essential parts of an embodied system and usually require separate frameworks to model each of them.
We propose a novel framework called LLM-Brain: using Large-scale Language Model as a robotic brain to unify egocentric memory and control.
arXiv Detail & Related papers (2023-04-19T00:08:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.