Related papers: WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

URL: http://arxiv.org/abs/2602.22923v1
Date: Thu, 26 Feb 2026 12:12:40 GMT
Title: WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents
Authors: Runwei Guan, Shaofeng Liang, Ningwei Ouyang, Weichen Fei, Shanliang Yao, Wei Dai, Chenhao Ge, Penglei Sun, Xiaohui Zhu, Tao Huang, Ryan Wen Liu, Hui Xiong,
Abstract summary: We present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments.<n>We also introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning.
Score: 23.828845891763617
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

Related papers

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z)
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions [66.84396313837765]
We introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions.<n>We provide a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery.<n>We also introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons.
arXiv Detail & Related papers (2026-02-05T16:31:43Z)
AirHunt: Bridging VLM Semantics and Continuous Planning for Efficient Aerial Object Navigation [13.973823761671673]
AirHunt is an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments.<n>AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM semantic reasoning and path planning.<n>We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time.
arXiv Detail & Related papers (2026-01-19T05:50:03Z)
VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory [43.2995099083993]
VLA models have shown promising potential in embodied navigation by unifying perception and planning.<n>Most existing VLA models rely on reactive mappings directly from observations to actions.<n>We propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition.
arXiv Detail & Related papers (2026-01-13T15:43:43Z)
IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation [56.43007596544299]
IndustryNav is the first dynamic industrial navigation benchmark for active spatial reasoning.<n>A study of nine state-of-the-art Visual Large Language Models reveals that closed-source models maintain a consistent advantage.
arXiv Detail & Related papers (2025-11-21T16:48:49Z)
Unified Multimodal Vessel Trajectory Prediction with Explainable Navigation Intention [18.699213433572996]
Vessel trajectory prediction is fundamental to intelligent maritime systems.<n>Existing vessel trajectory prediction methods suffer from limited scenario applicability and insufficient explainability.<n>We propose a unified vessel trajectory prediction framework incorporating explainable navigation intentions.
arXiv Detail & Related papers (2025-11-18T08:56:30Z)
Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection [54.1960918379255]
Neptune-X is a data-centric generative-selection framework for maritime object detection.<n>X-to-Maritime is a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes.<n>Our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy.
arXiv Detail & Related papers (2025-09-25T04:59:02Z)
MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving [85.04826012938642]
MetAdv is a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation.<n>It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments.<n>It enables real-time capture of physiological signals and behavioral feedback from drivers.
arXiv Detail & Related papers (2025-08-04T03:07:54Z)
PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications [0.0]
PhysNav-DG is a novel framework that integrates classical sensor fusion with the semantic power of vision-language models.<n>Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations.
arXiv Detail & Related papers (2025-05-03T17:59:26Z)
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants [24.689242976554482]
Navigating unfamiliar environments presents significant challenges for household robots.<n>Existing reinforcement learning methods cannot be directly transferred to new environments.<n>We try to transfer the logical knowledge and the generalization ability of pre-trained foundation models to zero-shot navigation.
arXiv Detail & Related papers (2025-02-19T17:27:47Z)
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.