Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in
Open Worlds
- URL: http://arxiv.org/abs/2310.13255v2
- Date: Thu, 7 Dec 2023 05:36:26 GMT
- Title: Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in
Open Worlds
- Authors: Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu
- Abstract summary: Large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world.
LLMs tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game"
We propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation.
- Score: 37.22688246779871
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent studies have presented compelling evidence that large language models
(LLMs) can equip embodied agents with the self-driven capability to interact
with the world, which marks an initial step toward versatile robotics. However,
these efforts tend to overlook the visual richness of open worlds, rendering
the entire interactive process akin to "a blindfolded text-based game."
Consequently, LLM-based agents frequently encounter challenges in intuitively
comprehending their surroundings and producing responses that are easy to
understand. In this paper, we propose Steve-Eye, an end-to-end trained large
multimodal model designed to address this limitation. Steve-Eye integrates the
LLM with a visual encoder which enables it to process visual-text inputs and
generate multimodal feedback. In addition, we use a semi-automatic strategy to
collect an extensive dataset comprising 850K open-world instruction pairs,
empowering our model to encompass three essential functions for an agent:
multimodal perception, foundational knowledge base, and skill prediction and
planning. Lastly, we develop three open-world evaluation benchmarks, then carry
out extensive experiments from a wide range of perspectives to validate our
model's capability to strategically act and plan. Codes and datasets will be
released.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework [47.58359136198136]
We introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models.
VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features.
This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2024-03-14T01:39:40Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - VCoder: Versatile Vision Encoders for Multimodal Large Language Models [46.95488342139727]
Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks.
However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail.
We propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs.
arXiv Detail & Related papers (2023-12-21T18:49:47Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.