Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
- URL: http://arxiv.org/abs/2602.23937v1
- Date: Fri, 27 Feb 2026 11:38:06 GMT
- Title: Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
- Authors: Haoxuan Xu, Tianfu Li, Wenbo Chen, Yi Liu, Xingxing Zuo, Yaoxian Song, Haoang Li,
- Abstract summary: We propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion.<n>We extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory.<n> Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our strategy.
- Score: 15.251897505310682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - Event Extraction in Large Language Model [99.94321497574805]
We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions.<n>This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks.
arXiv Detail & Related papers (2025-12-22T16:22:14Z) - Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z) - VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs [2.779512031764865]
We present Visual Scene Understanding system that tackles challenges using knowledge graph construction and efficient query processing for identification.<n>We also introduce WalkieKnowledge, a new benchmark with about 200 manually annotated questions across 8 diverse trajectories spanning approximately 100 minutes of video data.
arXiv Detail & Related papers (2025-10-01T21:53:44Z) - VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z) - EventVL: Understand Event Streams via Multimodal Large Language Model [29.23525787969373]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding.<n> Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset.<n>To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z) - DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes [76.24687327731031]
We first study the challenge of open-vocabulary object navigation by introducing DivScene.<n>Our dataset provides a much greater diversity of target objects and scene types than existing datasets.<n>We fine-tuned LVLMs to predict the next action with CoT explanations.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning [4.754556073011081]
Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense.
We propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR.
arXiv Detail & Related papers (2024-04-22T03:05:32Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs [112.39389727164594]
Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches.
While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the temporal dynamics modeling, one of the crux of video synthesis.
In this work, we investigate strengthening awareness of video dynamics for DMs, for high-quality T2V generation
arXiv Detail & Related papers (2023-08-26T08:31:48Z) - Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - History Aware Multimodal Transformer for Vision-and-Language Navigation [96.80655332881432]
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes.
We introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making.
arXiv Detail & Related papers (2021-10-25T22:54:41Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.