MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- URL: http://arxiv.org/abs/2312.07472v4
- Date: Tue, 26 Mar 2024 18:08:05 GMT
- Title: MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
- Authors: Yiran Qin, Enshen Zhou, Qichang Liu, Zhenfei Yin, Lu Sheng, Ruimao Zhang, Yu Qiao, Jing Shao,
- Abstract summary: MP5 is an open-ended multimodal embodied system built upon the challenging Minecraft simulator.
It can decompose feasible sub-objectives, design sophisticated situation-aware plans, and perform embodied action control.
- Score: 53.20509532671891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways. However, existing approaches usually struggle with compound difficulties caused by the logic-aware decomposition and context-aware execution of these tasks. To this end, we introduce MP5, an open-ended multimodal embodied system built upon the challenging Minecraft simulator, which can decompose feasible sub-objectives, design sophisticated situation-aware plans, and perform embodied action control, with frequent communication with a goal-conditioned active perception scheme. Specifically, MP5 is developed on top of recent advances in Multimodal Large Language Models (MLLMs), and the system is modulated into functional modules that can be scheduled and collaborated to ultimately solve pre-defined context- and process-dependent tasks. Extensive experiments prove that MP5 can achieve a 22% success rate on difficult process-dependent tasks and a 91% success rate on tasks that heavily depend on the context. Moreover, MP5 exhibits a remarkable ability to address many open-ended tasks that are entirely novel.
Related papers
- COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models [49.24666980374751]
COHERENT is a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems.
A Proposal-Execution-Feedback-Adjustment mechanism is designed to decompose and assign actions for individual robots.
The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency.
arXiv Detail & Related papers (2024-09-23T15:53:41Z) - QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds [51.05639500325598]
QuadrupedGPT is a versatile agent designed to master a broad range of complex tasks with agility comparable to that of a pet.
Our agent processes human command and environmental contexts using a large multimodal model (LMM)
It is equipped with problem-solving capabilities that enable it to decompose long-term goals into a sequence of executable subgoals.
arXiv Detail & Related papers (2024-06-24T12:14:24Z) - MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions.
MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Projected Task-Specific Layers for Multi-Task Reinforcement Learning [0.0]
Multi-task reinforcement learning could enable robots to scale across a wide variety of manipulation tasks in homes and workplaces.
Addressing this challenge by successfully sharing information across tasks will depend on how well the structure underlying the tasks is captured.
arXiv Detail & Related papers (2023-09-15T21:42:06Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z) - Multi-Task Learning with Sequence-Conditioned Transporter Networks [67.57293592529517]
We aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling.
We propose a new suite of benchmark aimed at compositional tasks, MultiRavens, which allows defining custom task combinations.
Second, we propose a vision-based end-to-end system architecture, Sequence-Conditioned Transporter Networks, which augments Goal-Conditioned Transporter Networks with sequence-conditioning and weighted sampling.
arXiv Detail & Related papers (2021-09-15T21:19:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.