Related papers: LEGENT: Open Platform for Embodied Agents

LEGENT: Open Platform for Embodied Agents

URL: http://arxiv.org/abs/2404.18243v2
Date: Sun, 11 Aug 2024 17:18:30 GMT
Title: LEGENT: Open Platform for Embodied Agents
Authors: Zhili Cheng, Zhitong Wang, Jinyi Hu, Shengding Hu, An Liu, Yuge Tu, Pengkai Li, Lei Shi, Zhiyuan Liu, Maosong Sun,
Abstract summary: We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs) LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface. In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
Score: 60.71847900126832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.

Related papers

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models [139.19991097260115]
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
arXiv Detail & Related papers (2025-04-14T17:59:25Z)
Cooperative Multi-Agent Planning with Adaptive Skill Synthesis [16.228784877899976]
Multi-agent systems with reinforcement learning face challenges in sample efficiency, interpretability, and transferability. We present a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making.
arXiv Detail & Related papers (2025-02-14T13:23:18Z)
Large Action Models: From Inception to Implementation [51.81485642442344]
Large Action Models (LAMs) are designed for action generation and execution within dynamic environments. LAMs hold the potential to transform AI from passive language understanding to active task completion. We present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment.
arXiv Detail & Related papers (2024-12-13T11:19:56Z)
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation. We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents. We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z)
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents [50.12414817737912]
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
arXiv Detail & Related papers (2024-08-12T17:44:17Z)
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality [28.27036270001756]
This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment.
arXiv Detail & Related papers (2024-05-16T14:20:30Z)
Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data. We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z)
Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds [37.22688246779871]
Large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world. LLMs tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game" We propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.
arXiv Detail & Related papers (2023-10-20T03:22:05Z)
NExT-GPT: Any-to-Any Multimodal LLM [75.5656492989924]
We present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. We introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
arXiv Detail & Related papers (2023-09-11T15:02:25Z)
Gentopia: A Collaborative Platform for Tool-Augmented LLMs [21.09079715807735]
We present gentopia, an Augmented Language Models (ALMs) framework enabling flexible customization of agents through simple configurations. We also establish gentpool, a public platform enabling the registration and sharing of user-customized agents. gentbench, an integral component of gentpool, is designed to thoroughly evaluate user-customized agents across diverse aspects such as safety, robustness, efficiency, etc.
arXiv Detail & Related papers (2023-08-08T04:12:29Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.