Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding
- URL: http://arxiv.org/abs/2506.19288v2
- Date: Tue, 01 Jul 2025 01:07:35 GMT
- Title: Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding
- Authors: Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue, Hui Xiong,
- Abstract summary: We introduce WaterCaption, the first captioning dataset specifically designed for waterway environments.<n>WaterCaption focuses on fine-grained, multi-region long-text descriptions.<n>We propose Da Yu, an edge-deployable multi-modal large language model for USVs.
- Score: 25.87853252053879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.
Related papers
- Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z) - Inland Waterway Object Detection in Multi-environment: Dataset and Approach [12.00732943849236]
This paper introduces the Multi-environment Inland Waterway Vessel dataset (MEIWVD)<n>MEIWVD comprises 32,478 high-quality images from diverse scenarios, including sunny, rainy, foggy, and artificial lighting conditions.<n>This paper proposes a scene-guided image enhancement module to improve water surface images based on environmental conditions adaptively.
arXiv Detail & Related papers (2025-04-07T08:45:00Z) - AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis [40.27548815196493]
We introduce AquaticCLIP, a novel contrastive language-image pre-training model tailored for aquatic scene understanding.<n> AquaticCLIP presents a new unsupervised learning framework that aligns images and texts in aquatic environments.<n>Our model sets a new benchmark for vision-language applications in underwater environments.
arXiv Detail & Related papers (2025-02-03T19:56:16Z) - WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar [14.984396484574509]
We introduce WaterVG, the first visual grounding dataset designed for U.S.V-based waterway perception based on human prompts.
WaterVG includes 11,568 samples with 34,987 referred targets, whose prompts integrate both visual and radar characteristics.
We propose a low-power visual grounding model, Potamoi, which is a multi-task model with a well-designed Phased Heterogeneous Modality Fusion mode.
arXiv Detail & Related papers (2024-03-19T12:45:18Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - Linguistic More: Taking a Further Step toward Efficient and Accurate
Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task.
Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper.
We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.