Related papers: FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models

FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models

URL: http://arxiv.org/abs/2506.21627v1
Date: Tue, 24 Jun 2025 14:11:22 GMT
Title: FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models
Authors: Shiyi Wang, Wenbo Li, Yiteng Chen, Qingyao Wu, Huiping Zhuang,
Abstract summary: Vision-Language Models (VLMs) have acquired rich world knowledge, exhibiting exceptional scene understanding and multimodal reasoning capabilities.<n>We propose FrankenBot, a VLM-driven, brain-morphic robotic manipulation framework that achieves both comprehensive functionality and high operational efficiency.
Score: 35.83717913117858
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing a general robot manipulation system capable of performing a wide range of tasks in complex, dynamic, and unstructured real-world environments has long been a challenging task. It is widely recognized that achieving human-like efficiency and robustness manipulation requires the robotic brain to integrate a comprehensive set of functions, such as task planning, policy generation, anomaly monitoring and handling, and long-term memory, achieving high-efficiency operation across all functions. Vision-Language Models (VLMs), pretrained on massive multimodal data, have acquired rich world knowledge, exhibiting exceptional scene understanding and multimodal reasoning capabilities. However, existing methods typically focus on realizing only a single function or a subset of functions within the robotic brain, without integrating them into a unified cognitive architecture. Inspired by a divide-and-conquer strategy and the architecture of the human brain, we propose FrankenBot, a VLM-driven, brain-morphic robotic manipulation framework that achieves both comprehensive functionality and high operational efficiency. Our framework includes a suite of components, decoupling a part of key functions from frequent VLM calls, striking an optimal balance between functional completeness and system efficiency. Specifically, we map task planning, policy generation, memory management, and low-level interfacing to the cortex, cerebellum, temporal lobe-hippocampus complex, and brainstem, respectively, and design efficient coordination mechanisms for the modules. We conducted comprehensive experiments in both simulation and real-world robotic environments, demonstrating that our method offers significant advantages in anomaly detection and handling, long-term memory, operational efficiency, and stability -- all without requiring any fine-tuning or retraining.

Related papers

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation [90.81956345363355]
RoBridge is a hierarchical intelligent architecture for general robotic manipulation.<n>It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM)<n>It unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution.
arXiv Detail & Related papers (2025-05-03T06:17:18Z)
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills [31.788094786664324]
Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research.<n>Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots.<n>We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library.<n>Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision.
arXiv Detail & Related papers (2025-03-16T14:53:53Z)
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [27.814422322892522]
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts.<n>They lack three essential robotic brain capabilities: Planning Capability, Affordance Perception, and Trajectory Prediction.<n>We introduce ShareRobot, a dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory.<n>We develop RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizing a multi-stage training strategy.
arXiv Detail & Related papers (2025-02-28T17:30:39Z)
Redefining Robot Generalization Through Interactive Intelligence [0.0]
We argue that robot foundation models must evolve to an interactive multi-agent perspective in order to handle the complexities of real-time human-robot co-adaptation.<n>By moving beyond single-agent designs, our position emphasizes how foundation models in robotics can achieve a more robust, personalized, and anticipatory level of performance.
arXiv Detail & Related papers (2025-02-09T17:13:27Z)
Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [50.191655141020505]
This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer.<n>By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
arXiv Detail & Related papers (2025-01-17T10:39:09Z)
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z)
LLM as A Robotic Brain: Unifying Egocentric Memory and Control [77.0899374628474]
Embodied AI focuses on the study and development of intelligent systems that possess a physical or virtual embodiment (i.e. robots) Memory and control are the two essential parts of an embodied system and usually require separate frameworks to model each of them. We propose a novel framework called LLM-Brain: using Large-scale Language Model as a robotic brain to unify egocentric memory and control.
arXiv Detail & Related papers (2023-04-19T00:08:48Z)
Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance [71.36749876465618]
We describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples. experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world.
arXiv Detail & Related papers (2022-12-19T22:50:40Z)
Cognitive architecture aided by working-memory for self-supervised multi-modal humans recognition [54.749127627191655]
The ability to recognize human partners is an important social skill to build personalized and long-term human-robot interactions. Deep learning networks have achieved state-of-the-art results and demonstrated to be suitable tools to address such a task. One solution is to make robots learn from their first-hand sensory data with self-supervision.
arXiv Detail & Related papers (2021-03-16T13:50:24Z)
Learning compositional models of robot skills for task and motion planning [39.36562555272779]
We learn to use sensorimotor primitives to solve complex long-horizon manipulation problems. We use state-of-the-art methods for active learning and sampling. We evaluate our approach both in simulation and in the real world through measuring the quality of the selected primitive actions.
arXiv Detail & Related papers (2020-06-08T20:45:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.