Related papers: VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs

VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs

URL: http://arxiv.org/abs/2510.01483v1
Date: Wed, 01 Oct 2025 21:53:44 GMT
Title: VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs
Authors: Mohamad Al Mdfaa, Svetlana Lukina, Timur Akhtyamov, Arthur Nigmatzyanov, Dmitrii Nalberskii, Sergey Zagoruyko, Gonzalo Ferrer,
Abstract summary: We present Visual Scene Understanding system that tackles challenges using knowledge graph construction and efficient query processing for identification.<n>We also introduce WalkieKnowledge, a new benchmark with about 200 manually annotated questions across 8 diverse trajectories spanning approximately 100 minutes of video data.
Score: 2.779512031764865
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) have shown potential for robot navigation but encounter fundamental limitations: they lack persistent scene memory, offer limited spatial reasoning, and do not scale effectively with video duration for real-time application. We present VL-KnG, a Visual Scene Understanding system that tackles these challenges using spatiotemporal knowledge graph construction and computationally efficient query processing for navigation goal identification. Our approach processes video sequences in chunks utilizing modern VLMs, creates persistent knowledge graphs that maintain object identity over time, and enables explainable spatial reasoning through queryable graph structures. We also introduce WalkieKnowledge, a new benchmark with about 200 manually annotated questions across 8 diverse trajectories spanning approximately 100 minutes of video data, enabling fair comparison between structured approaches and general-purpose VLMs. Real-world deployment on a differential drive robot demonstrates practical applicability, with our method achieving 77.27% success rate and 76.92% answer accuracy, matching Gemini 2.5 Pro performance while providing explainable reasoning supported by the knowledge graph, computational efficiency for real-time deployment across different tasks, such as localization, navigation and planning. Code and dataset will be released after acceptance.

Related papers

Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos [15.251897505310682]
We propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion.<n>We extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory.<n> Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our strategy.
arXiv Detail & Related papers (2026-02-27T11:38:06Z)
AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval [25.233263762328836]
We introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions.<n>It is the first remote sensing grounding dataset to provide 3DBB.<n>We develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval.
arXiv Detail & Related papers (2026-01-04T07:38:51Z)
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding [64.86119288520419]
multimodal language models struggle with spatial reasoning across time and space.<n>We present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators.<n>Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
arXiv Detail & Related papers (2025-11-06T18:53:31Z)
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding [48.64661382961745]
We introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos.<n>To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs.<n> Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes.
arXiv Detail & Related papers (2025-10-14T22:10:49Z)
Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents [60.095739427926524]
Long videos, characterized by temporal and sparse task-relevant information, pose significant reasoning challenges for AI systems.<n>Inspired by human progressive visual cognition, we propose CogniGPT for efficient and reliable long video understanding.
arXiv Detail & Related papers (2025-09-29T15:42:55Z)
Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z)
Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
Aligning Knowledge Graph with Visual Perception for Object-goal Navigation [16.32780793344835]
We propose the Aligning Knowledge Graph with Visual Perception (AKGVP) method for object-goal navigation. Our approach introduces continuous modeling of the hierarchical scene architecture and leverages visual-language pre-training to align natural language description with visual perception. The integration of a continuous knowledge graph architecture and multimodal feature alignment empowers the navigator with a remarkable zero-shot navigation capability.
arXiv Detail & Related papers (2024-02-29T06:31:18Z)
Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video. We propose Latent Time Navigation (LTN) to capture fine-grained motions. Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Perception Framework through Real-Time Semantic Segmentation and Scene Recognition on a Wearable System for the Visually Impaired [27.04316520914628]
We present a multi-task efficient perception system for the scene parsing and recognition tasks. This system runs on a wearable belt with an Intel RealSense LiDAR camera and an Nvidia Jetson AGX Xavier processor.
arXiv Detail & Related papers (2021-03-06T15:07:17Z)
Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN) It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.