Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
- URL: http://arxiv.org/abs/2512.00294v1
- Date: Sat, 29 Nov 2025 03:29:15 GMT
- Title: Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
- Authors: Lixing Guo, Tobias Höllerer,
- Abstract summary: We present a modular augmented reality (AR) agent system that integrates multimodal large language models (MLLMs) with grounded vision models.<n>Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities.<n>The system guides human attention to information-dense areas while supporting human-in-the-loop refinement.
- Score: 8.295391485284298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.
Related papers
- OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding [53.33067495235966]
OnlineSI is a framework that can improve its spatial understanding of its surroundings given a video stream.<n>Our core idea is to maintain a finite spatial memory to retain past observations.<n>We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene.
arXiv Detail & Related papers (2026-01-23T08:17:57Z) - SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing [57.609801041296095]
Vision-language models (VLMs) are emerging as powerful tools for remote sensing.<n>We enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism.
arXiv Detail & Related papers (2025-12-09T18:15:43Z) - A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - Spatial Reasoner: A 3D Inference Pipeline for XR Applications [0.0]
We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks.<n>Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates.<n>The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation.
arXiv Detail & Related papers (2025-04-25T14:27:27Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment.
We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation.
We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z) - Cognitive Planning for Object Goal Navigation using Generative AI Models [0.979851640406258]
We present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies.
Our approach enables a robot to navigate unfamiliar environments by leveraging Large Language Models (LLMs) and Large Vision-Language Models (LVLMs)
arXiv Detail & Related papers (2024-03-30T10:54:59Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity
Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description.
Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains.
We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.