Related papers: How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs

How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs

URL: http://arxiv.org/abs/2602.18981v1
Date: Sat, 21 Feb 2026 23:15:18 GMT
Title: How Far Can We Go with Pixels Alone? A Pilot Study on Screen-Only Navigation in Commercial 3D ARPGs
Authors: Kaijie Xu, Mustafa Bugti, Clark Verbrugge,
Abstract summary: We build on an existing open-source visual affordance detector and instantiate a screen-only exploration and navigation agent.<n>Our agent consumes live game frames, identifies salient interest points, and drives a simple finite-state controller over a minimal action space.<n>Pilot experiments show that the agent can traverse most required segments and exhibits meaningful visual navigation behavior.
Score: 2.8993790400286876
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern 3D game levels rely heavily on visual guidance, yet the navigability of level layouts remains difficult to quantify. Prior work either simulates play in simplified environments or analyzes static screenshots for visual affordances, but neither setting faithfully captures how players explore complex, real-world game levels. In this paper, we build on an existing open-source visual affordance detector and instantiate a screen-only exploration and navigation agent that operates purely from visual affordances. Our agent consumes live game frames, identifies salient interest points, and drives a simple finite-state controller over a minimal action space to explore Dark Souls-style linear levels and attempt to reach expected goal regions. Pilot experiments show that the agent can traverse most required segments and exhibits meaningful visual navigation behavior, but also highlight that limitations of the underlying visual model prevent truly comprehensive and reliable auto-navigation. We argue that this system provides a concrete, shared baseline and evaluation protocol for visual navigation in complex games, and we call for more attention to this necessary task. Our results suggest that purely vision-based sense-making models, with discrete single-modality inputs and without explicit reasoning, can effectively support navigation and environment understanding in idealized settings, but are unlikely to be a general solution on their own.

Related papers

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting [12.057873540714098]
3DGSNav is a novel framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for vision-language models (VLMs) to enhance spatial reasoning.<n>3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views.<n>During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification.
arXiv Detail & Related papers (2026-02-12T16:41:26Z)
FeudalNav: A Simple Framework for Visual Navigation [7.136542835931238]
We develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels.<n>Our method learns to select subgoals through a simple, transferable waypoint selection network.<n>We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference.
arXiv Detail & Related papers (2026-01-15T22:10:29Z)
Visuospatial navigation without distance, prediction, integration, or maps [1.3812010983144802]
Navigation is controlled by at least two partially dissociable, concurrently developed systems in the brain.<n>Here we demonstrate the sufficiency of visual response-based decision-making in a classic open field navigation task often assumed to require a cognitive map.<n>Three distinct strategies emerge to robustly navigate to a hidden goal, each conferring contextual tradeoffs, as well as aligning with behavior observed with rodents, insects, fish, and sperm cells.
arXiv Detail & Related papers (2024-07-18T14:07:44Z)
Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z)
Learning Navigational Visual Representations with Semantic Map Supervision [85.91625020847358]
We propose a navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps. Ego$2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation.
arXiv Detail & Related papers (2023-07-23T14:01:05Z)
CCPT: Automatic Gameplay Testing and Validation with Curiosity-Conditioned Proximal Trajectories [65.35714948506032]
The Curiosity-Conditioned Proximal Trajectories (CCPT) method combines curiosity and imitation learning to train agents to explore. We show how CCPT can explore complex environments, discover gameplay issues and design oversights in the process, and recognize and highlight them directly to game designers.
arXiv Detail & Related papers (2022-02-21T09:08:33Z)
Augmented reality navigation system for visual prosthesis [67.09251544230744]
We propose an augmented reality navigation system for visual prosthesis that incorporates a software of reactive navigation and path planning. It consists on four steps: locating the subject on a map, planning the subject trajectory, showing it to the subject and re-planning without obstacles. Results show how our augmented navigation system help navigation performance by reducing the time and distance to reach the goals, even significantly reducing the number of obstacles collisions.
arXiv Detail & Related papers (2021-09-30T09:41:40Z)
Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation. This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z)
Pathdreamer: A World Model for Indoor Navigation [62.78410447776939]
We introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations. In regions of high uncertainty, Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes.
arXiv Detail & Related papers (2021-05-18T18:13:53Z)
Diagnosing Vision-and-Language Navigation: What Really Matters [61.72935815656582]
Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Recent studies witness a slow-down in the performance improvements in both indoor and outdoor VLN tasks. In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation.
arXiv Detail & Related papers (2021-03-30T17:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.