NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
- URL: http://arxiv.org/abs/2510.27481v1
- Date: Fri, 31 Oct 2025 14:00:35 GMT
- Title: NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding
- Authors: Wei Xu, Cheng Wang, Dingkang Liang, Zongchuang Zhao, Xingyu Jiang, Peng Zhang, Xiang Bai,
- Abstract summary: We study the underwater scene understanding methods, which aim to achieve automated underwater exploration.<n>NautData is a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks.<n>We propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information.
- Score: 60.76337064425815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.
Related papers
- UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding [54.16709436340606]
Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding.<n>Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering.<n>We introduce UWBench, a benchmark specifically designed for underwater vision-language understanding.
arXiv Detail & Related papers (2025-10-21T03:32:15Z) - DUViN: Diffusion-Based Underwater Visual Navigation via Knowledge-Transferred Depth Features [47.88998580611257]
We propose a Diffusion-based Underwater Visual Navigation policy via knowledge-transferred depth features, named DUViN.<n>DuViN guides the vehicle to avoid obstacles and maintain a safe and perception awareness altitude relative to the terrain without relying on pre-built maps.<n> Experiments in both simulated and real-world underwater environments demonstrate the effectiveness and generalization of our approach.
arXiv Detail & Related papers (2025-09-03T03:43:12Z) - Advancing Marine Research: UWSAM Framework and UIIS10K Dataset for Precise Underwater Instance Segmentation [110.02397462607449]
We propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories.<n>We then introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances.<n>We show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets.
arXiv Detail & Related papers (2025-05-21T14:36:01Z) - Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention [13.73105543582749]
UnderWater SegFormer (UWSegFormer) is a transformer-based framework for semantic segmentation of low-quality underwater images.<n>The proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods.
arXiv Detail & Related papers (2025-03-30T12:47:56Z) - Enhancing Underwater Imaging with 4-D Light Fields: Dataset and Method [77.80712860663886]
4-D light fields (LFs) enhance underwater imaging plagued by light absorption, scattering, and other challenges.<n>We propose a progressive framework for underwater 4-D LF image enhancement and depth estimation.<n>We construct the first 4-D LF-based underwater image dataset for quantitative evaluation and supervised training of learning-based methods.
arXiv Detail & Related papers (2024-08-30T15:06:45Z) - WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth Estimation [4.909989222186828]
We propose WaterMono, a novel framework for depth estimation and image enhancement.
It incorporates the following key measures: (1) We present a Teacher-Guided Anomaly Mask to identify dynamic regions within the images; (2) We employ depth information combined with the Underwater Image Formation Model to generate enhanced images, which in turn contribute to the depth estimation task; and (3) We utilize a rotated distillation strategy to enhance the model's rotational robustness.
arXiv Detail & Related papers (2024-06-19T08:49:45Z) - Diving into Underwater: Segment Anything Model Guided Underwater Salient Instance Segmentation and A Large-scale Dataset [60.14089302022989]
Underwater vision tasks often suffer from low segmentation accuracy due to the complex underwater circumstances.
We construct the first large-scale underwater salient instance segmentation dataset (USIS10K)
We propose an Underwater Salient Instance architecture based on Segment Anything Model (USIS-SAM) specifically for the underwater domain.
arXiv Detail & Related papers (2024-06-10T06:17:33Z) - Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion [30.122666238416716]
We propose a novel pipeline for generating underwater images using accurate terrestrial depth data.
This approach facilitates the training of supervised models for underwater depth estimation.
We introduce a unique Depth2Underwater ControlNet, trained on specially prepared Underwater, Depth, Text data triplets.
arXiv Detail & Related papers (2023-12-19T08:56:33Z) - Virtual Underwater Datasets for Autonomous Inspections [0.0]
This study builds a bespoke dataset from photographs of items captured in a laboratory environment.
Generative Adversarial Networks (GANs) were utilised to translate the laboratory object dataset into the underwater domain.
The resulting images closely resembled the real underwater environment when compared with real-world underwater ship hull images.
arXiv Detail & Related papers (2022-09-13T14:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.