WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar
- URL: http://arxiv.org/abs/2403.12686v3
- Date: Fri, 5 Apr 2024 02:34:01 GMT
- Title: WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar
- Authors: Runwei Guan, Liye Jia, Fengyufan Yang, Shanliang Yao, Erick Purwanto, Xiaohui Zhu, Eng Gee Lim, Jeremy Smith, Ka Lok Man, Xuming Hu, Yutao Yue,
- Abstract summary: We introduce WaterVG, the first visual grounding dataset designed for U.S.V-based waterway perception based on human prompts.
WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrate both visual and radar characteristics.
We propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion mode.
- Score: 14.984396484574509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The perception of waterways based on human intent is significant for autonomous navigation and operations of Unmanned Surface Vehicles (USVs) in water environments. Inspired by visual grounding, we introduce WaterVG, the first visual grounding dataset designed for USV-based waterway perception based on human prompts. WaterVG encompasses prompts describing multiple targets, with annotations at the instance level including bounding boxes and masks. Notably, WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrates both visual and radar characteristics. The pattern of text-guided two sensors equips a finer granularity of text prompts with visual and radar features of referred targets. Moreover, we propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion (PHMF) mode, including Adaptive Radar Weighting (ARW) and Multi-Head Slim Cross Attention (MHSCA). Exactly, ARW extracts required radar features to fuse with vision for prompt alignment. MHSCA is an efficient fusion module with a remarkably small parameter count and FLOPs, elegantly fusing scenario context captured by two sensors with linguistic features, which performs expressively on visual grounding tasks. Comprehensive experiments and evaluations have been conducted on WaterVG, where our Potamoi archives state-of-the-art performances compared with counterparts.
Related papers
- RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence [10.115852646162843]
We present Radar-LLM, the first framework that leverages large language models (LLMs) for human understanding using millimeter-wave radar as the sensing modality.
To address data scarcity, we introduce a physics-aware pipeline synthesis that generates realistic radar-text pairs from motion-text datasets.
Radar-LLM achieves state-of-the-art performance across both synthetic and real-world benchmarks, enabling accurate translation of millimeter-wave signals to natural language descriptions.
arXiv Detail & Related papers (2025-04-14T04:18:25Z) - Inland Waterway Object Detection in Multi-environment: Dataset and Approach [12.00732943849236]
This paper introduces the Multi-environment Inland Waterway Vessel dataset (MEIWVD)
MEIWVD comprises 32,478 high-quality images from diverse scenarios, including sunny, rainy, foggy, and artificial lighting conditions.
This paper proposes a scene-guided image enhancement module to improve water surface images based on environmental conditions adaptively.
arXiv Detail & Related papers (2025-04-07T08:45:00Z) - Towards an Autonomous Surface Vehicle Prototype for Artificial Intelligence Applications of Water Quality Monitoring [68.41400824104953]
This paper presents a vehicle prototype that addresses the use of Artificial Intelligence algorithms and enhanced sensing techniques for water quality monitoring.
The vehicle is fully equipped with high-quality sensors to measure water quality parameters and water depth.
By means of a stereo-camera, it also can detect and locate macro-plastics in real environments.
arXiv Detail & Related papers (2024-10-08T10:35:32Z) - NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar [7.8129510753821325]
NanoMVG is a low-power multi-task model for waterway embodied perception.
It guides both camera and 4D millimeter-wave radar to locate specific object(s) through natural language.
arXiv Detail & Related papers (2024-08-30T11:22:09Z) - Improving Zero-Shot ObjectNav with Generative Communication [60.84730028539513]
We propose a new method for improving zero-shot ObjectNav.
Our approach takes into account that the ground agent may have limited and sometimes obstructed view.
arXiv Detail & Related papers (2024-08-03T22:55:26Z) - Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension [21.598751853520834]
4D millimeter-wave radars provide denser point clouds than conventional radars and perceive both semantic and physical characteristics of objects.
To foster the development of natural language-driven context understanding in radar scenes for 3D visual grounding, we construct the first dataset, Talk2Radar.
We propose a novel model, T-RadarNet, for 3D Referring Expression on point clouds, achieving State-Of-The-Art (SOTA) performance on the Talk2Radar dataset.
arXiv Detail & Related papers (2024-05-21T14:26:36Z) - Radar Fields: Frequency-Space Neural Scene Representations for FMCW Radar [62.51065633674272]
We introduce Radar Fields - a neural scene reconstruction method designed for active radar imagers.
Our approach unites an explicit, physics-informed sensor model with an implicit neural geometry and reflectance model to directly synthesize raw radar measurements.
We validate the effectiveness of the method across diverse outdoor scenarios, including urban scenes with dense vehicles and infrastructure.
arXiv Detail & Related papers (2024-05-07T20:44:48Z) - ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar [7.2865477881451755]
Asymmetric Fair Fusion (AFF) modules designed to efficiently interact with independent features from both visual and radar modalities.
ASY-VRNet model processes image and radar features based on irregular super-pixel point sets.
Compared to other lightweight models, ASY-VRNet achieves state-of-the-art performance in object detection, semantic segmentation, and drivable-area segmentation.
arXiv Detail & Related papers (2023-08-20T14:53:27Z) - Vision-Based Autonomous Navigation for Unmanned Surface Vessel in
Extreme Marine Conditions [2.8983738640808645]
This paper presents an autonomous vision-based navigation framework for tracking target objects in extreme marine conditions.
The proposed framework has been thoroughly tested in simulation under extremely reduced visibility due to sandstorms and fog.
The results are compared with state-of-the-art de-hazing methods across the benchmarked MBZIRC simulation dataset.
arXiv Detail & Related papers (2023-08-08T14:25:13Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - VPAIR -- Aerial Visual Place Recognition and Localization in Large-scale
Outdoor Environments [49.82314641876602]
We present a new dataset named VPAIR.
The dataset was recorded on board a light aircraft flying at an altitude of more than 300 meters above ground.
The dataset covers a more than one hundred kilometers long trajectory over various types of challenging landscapes.
arXiv Detail & Related papers (2022-05-23T18:50:08Z) - Safe Vessel Navigation Visually Aided by Autonomous Unmanned Aerial
Vehicles in Congested Harbors and Waterways [9.270928705464193]
This work is the first attempt to detect and estimate distances to unknown objects from long-range visual data captured with conventional RGB cameras and auxiliary absolute positioning systems (e.g. GPS)
The simulation results illustrate the accuracy and efficacy of the proposed method for visually aided navigation of vessels assisted by UAV.
arXiv Detail & Related papers (2021-08-09T08:15:17Z) - Perceiving Traffic from Aerial Images [86.994032967469]
We propose an object detection method called Butterfly Detector that is tailored to detect objects in aerial images.
We evaluate our Butterfly Detector on two publicly available UAV datasets (UAVDT and VisDrone 2019) and show that it outperforms previous state-of-the-art methods while remaining real-time.
arXiv Detail & Related papers (2020-09-16T11:37:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.