Related papers: RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

URL: http://arxiv.org/abs/2506.06677v1
Date: Sat, 07 Jun 2025 06:15:49 GMT
Title: RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation
Authors: Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu,
Abstract summary: We introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation.<n>The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences.<n>Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations.
Score: 80.20970723577818
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

Related papers

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [44.99833362998488]
We propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments.<n>The module uses the reasoning capabilities of the Large Language Models to evaluate logical coherence and identify potential gaps in the plan.<n>We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems.
arXiv Detail & Related papers (2025-07-07T15:31:36Z)
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks [31.3295171851909]
Real-world embodied agents face high-level goals demanding multi-step solutions.<n>Long-horizon tasks require high-level task planning and low-level motion control.<n>We introduce a new unified vision language framework for long-horizon tasks, dubbed LoHoVLA.
arXiv Detail & Related papers (2025-05-31T06:01:03Z)
RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [22.007302996282085]
This paper presents a temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture.<n>Results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline Visual Language Models.
arXiv Detail & Related papers (2025-04-02T19:02:08Z)
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation [57.628771707989166]
We propose an adaptive multi-agent planning framework, termed REMAC, that enables efficient, scene-agnostic multi-robot long-horizon task planning and execution.<n>ReMAC incorporates two key modules: a self-reflection module performing pre-conditions and post-condition checks in the loop to evaluate progress and refine plans, and a self-evolvement module dynamically adapting plans based on scene-specific reasoning.
arXiv Detail & Related papers (2025-03-28T03:51:40Z)
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.<n>We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM [0.26334346517416873]
Vision-Language-Action (VLA) models enable robots to perform complex tasks by integrating visual context with linguistic commands. To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory. Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates.
arXiv Detail & Related papers (2024-10-21T00:36:02Z)
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation. We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.