Related papers: Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR

Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR

URL: http://arxiv.org/abs/2512.00294v1
Date: Sat, 29 Nov 2025 03:29:15 GMT
Title: Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
Authors: Lixing Guo, Tobias Höllerer,
Abstract summary: We present a modular augmented reality (AR) agent system that integrates multimodal large language models (MLLMs) with grounded vision models.<n>Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities.<n>The system guides human attention to information-dense areas while supporting human-in-the-loop refinement.
Score: 8.295391485284298
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.

Related papers

OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding [53.33067495235966]
OnlineSI is a framework that can improve its spatial understanding of its surroundings given a video stream.<n>Our core idea is to maintain a finite spatial memory to retain past observations.<n>We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene.
arXiv Detail & Related papers (2026-01-23T08:17:57Z)
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing [57.609801041296095]
Vision-language models (VLMs) are emerging as powerful tools for remote sensing.<n>We enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism.
arXiv Detail & Related papers (2025-12-09T18:15:43Z)
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
Spatial Reasoner: A 3D Inference Pipeline for XR Applications [0.0]
We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks.<n>Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates.<n>The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation.
arXiv Detail & Related papers (2025-04-25T14:27:27Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI [10.335943413484815]
seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. We introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time.
arXiv Detail & Related papers (2024-10-06T23:25:21Z)
Cognitive Planning for Object Goal Navigation using Generative AI Models [0.979851640406258]
We present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies. Our approach enables a robot to navigate unfamiliar environments by leveraging Large Language Models (LLMs) and Large Vision-Language Models (LVLMs)
arXiv Detail & Related papers (2024-03-30T10:54:59Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains. We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.