NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar
- URL: http://arxiv.org/abs/2408.17207v1
- Date: Fri, 30 Aug 2024 11:22:09 GMT
- Title: NanoMVG: USV-Centric Low-Power Multi-Task Visual Grounding based on Prompt-Guided Camera and 4D mmWave Radar
- Authors: Runwei Guan, Jianan Liu, Liye Jia, Haocheng Zhao, Shanliang Yao, Xiaohui Zhu, Ka Lok Man, Eng Gee Lim, Jeremy Smith, Yutao Yue,
- Abstract summary: NanoMVG is a low-power multi-task model for waterway embodied perception.
It guides both camera and 4D millimeter-wave radar to locate specific object(s) through natural language.
- Score: 7.8129510753821325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vehicles (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments and boasts ultra-low power consumption for long endurance.
Related papers
- SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments [7.251041314934871]
Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability.<n>This paper introduces SkyVLN, a novel framework integrating verbalize vision-and-language navigation (VLN) with Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments.
arXiv Detail & Related papers (2025-07-09T05:38:32Z) - USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways [6.061547952604821]
We present USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in waterborne transportation systems.<n>We present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers.
arXiv Detail & Related papers (2025-06-23T15:13:57Z) - NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments [56.35569661650558]
We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation.<n>Rather than constructing a global map, NOVA formulates perception, estimation, and control entirely in the target's reference frame.<n>We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss.
arXiv Detail & Related papers (2025-06-23T14:28:30Z) - Navigation World Models [68.58459393846461]
We introduce a controllable video generation model that predicts future visual observations based on past observations and navigation actions.
In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal.
Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy.
arXiv Detail & Related papers (2024-12-04T18:59:45Z) - UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection [2.123197540438989]
Many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information.
We propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process.
We also introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities.
arXiv Detail & Related papers (2024-09-23T06:57:27Z) - BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.
The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z) - Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation [38.04404612393027]
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction.
In this work, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding.
Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments.
arXiv Detail & Related papers (2024-06-14T07:50:09Z) - Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance [13.467819526775472]
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance.
We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation.
arXiv Detail & Related papers (2024-05-16T18:36:43Z) - Radar Fields: Frequency-Space Neural Scene Representations for FMCW Radar [62.51065633674272]
We introduce Radar Fields - a neural scene reconstruction method designed for active radar imagers.
Our approach unites an explicit, physics-informed sensor model with an implicit neural geometry and reflectance model to directly synthesize raw radar measurements.
We validate the effectiveness of the method across diverse outdoor scenarios, including urban scenes with dense vehicles and infrastructure.
arXiv Detail & Related papers (2024-05-07T20:44:48Z) - WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar [14.984396484574509]
We introduce WaterVG, the first visual grounding dataset designed for U.S.V-based waterway perception based on human prompts.
WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrate both visual and radar characteristics.
We propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion mode.
arXiv Detail & Related papers (2024-03-19T12:45:18Z) - ASY-VRNet: Waterway Panoptic Driving Perception Model based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar [7.2865477881451755]
Asymmetric Fair Fusion (AFF) modules designed to efficiently interact with independent features from both visual and radar modalities.
ASY-VRNet model processes image and radar features based on irregular super-pixel point sets.
Compared to other lightweight models, ASY-VRNet achieves state-of-the-art performance in object detection, semantic segmentation, and drivable-area segmentation.
arXiv Detail & Related papers (2023-08-20T14:53:27Z) - DADFNet: Dual Attention and Dual Frequency-Guided Dehazing Network for
Video-Empowered Intelligent Transportation [79.18450119567315]
Adverse weather conditions pose severe challenges for video-based transportation surveillance.
We propose a dual attention and dual frequency-guided dehazing network (termed DADFNet) for real-time visibility enhancement.
arXiv Detail & Related papers (2023-04-19T11:55:30Z) - Ultra-low Power Deep Learning-based Monocular Relative Localization
Onboard Nano-quadrotors [64.68349896377629]
This work presents a novel autonomous end-to-end system that addresses the monocular relative localization, through deep neural networks (DNNs), of two peer nano-drones.
To cope with the ultra-constrained nano-drone platform, we propose a vertically-integrated framework, including dataset augmentation, quantization, and system optimizations.
Experimental results show that our DNN can precisely localize a 10cm-size target nano-drone by employing only low-resolution monochrome images, up to 2m distance.
arXiv Detail & Related papers (2023-03-03T14:14:08Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - VPAIR -- Aerial Visual Place Recognition and Localization in Large-scale
Outdoor Environments [49.82314641876602]
We present a new dataset named VPAIR.
The dataset was recorded on board a light aircraft flying at an altitude of more than 300 meters above ground.
The dataset covers a more than one hundred kilometers long trajectory over various types of challenging landscapes.
arXiv Detail & Related papers (2022-05-23T18:50:08Z) - A Multi-UAV System for Exploration and Target Finding in Cluttered and
GPS-Denied Environments [68.31522961125589]
We propose a framework for a team of UAVs to cooperatively explore and find a target in complex GPS-denied environments with obstacles.
The team of UAVs autonomously navigates, explores, detects, and finds the target in a cluttered environment with a known map.
Results indicate that the proposed multi-UAV system has improvements in terms of time-cost, the proportion of search area surveyed, as well as successful rates for search and rescue missions.
arXiv Detail & Related papers (2021-07-19T12:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.