Related papers: VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

URL: http://arxiv.org/abs/2507.00079v1
Date: Sun, 29 Jun 2025 14:16:11 GMT
Title: VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems
Authors: Ethan Smyth, Alessandro Suglia,
Abstract summary: This paper proposes VoyagerVision, a model capable of creating structures within Minecraft using screenshots as a form of visual feedback.<n>VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures.
Score: 50.97354139604596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent's POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision -- a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at https://esmyth-dev.github.io/VoyagerVision.github.io/

Related papers

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model. Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model [59.04877271899894]
This paper explores adapting the zero-shot ability of SAM to 3D object detection in this paper. We propose a SAM-powered BEV processing pipeline to detect objects and get promising results on the large-scale open dataset.
arXiv Detail & Related papers (2023-06-04T03:09:21Z)
Voyager: An Open-Ended Embodied Agent with Large Language Models [103.76509266014165]
Voyager is the first embodied lifelong learning agent in Minecraft. It continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch.
arXiv Detail & Related papers (2023-05-25T17:46:38Z)
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions. Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z)
Out of the Box: Embodied Navigation in the Real World [45.97756658635314]
We show how to transfer knowledge acquired in simulation into the real world. We deploy our models on a LoCoBot equipped with a single Intel RealSense camera. Our experiments indicate that it is possible to achieve satisfying results when deploying the obtained model in the real world.
arXiv Detail & Related papers (2021-05-12T18:00:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.