Semantic-Aware Ship Detection with Vision-Language Integration
- URL: http://arxiv.org/abs/2508.15930v1
- Date: Thu, 21 Aug 2025 19:24:52 GMT
- Title: Semantic-Aware Ship Detection with Vision-Language Integration
- Authors: Jiahao Li, Jiancheng Pan, Yuze Sun, Xiaomeng Huang,
- Abstract summary: Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies.<n>We propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy.<n>We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.
- Score: 9.49989812166076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.
Related papers
- 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting [12.057873540714098]
3DGSNav is a novel framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for vision-language models (VLMs) to enhance spatial reasoning.<n>3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views.<n>During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification.
arXiv Detail & Related papers (2026-02-12T16:41:26Z) - Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning [5.517595398768408]
We present a unified aerial VLN framework that operates solely on ego monocular RGB observations and natural language instructions.<n>This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery.
arXiv Detail & Related papers (2025-12-09T14:25:24Z) - AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z) - A Multimodal Depth-Aware Method For Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues.<n>We propose a novel ERU framework that jointly leverages data augmentation, depth-map modality, and a depth-aware decision module.
arXiv Detail & Related papers (2025-10-09T14:32:21Z) - AIS-LLM: A Unified Framework for Maritime Trajectory Prediction, Anomaly Detection, and Collision Risk Assessment with Explainable Forecasting [7.615963953174766]
We propose a novel framework, AIS-LLM, which integrates time-series AIS data with a large language model (LLM)<n>This architecture enables the simultaneous execution of three key tasks: trajectory prediction, anomaly detection, and risk assessment of vessel collisions within a single end-to-end system.<n>By integratively analyzing task outputs to generate situation summaries and briefings, AIS-LLM presents the potential for more intelligent and efficient maritime traffic management.
arXiv Detail & Related papers (2025-08-11T06:39:45Z) - Task-Oriented Low-Label Semantic Communication With Self-Supervised Learning [67.06363342414397]
Task-oriented semantic communication enhances transmission efficiency by conveying semantic information rather than exact messages.<n>Deep learning (DL)-based semantic communication can effectively cultivate the essential semantic knowledge for semantic extraction, transmission, and interpretation.<n>We propose a self-supervised learning-based semantic communication framework (SLSCom) to enhance task inference performance.
arXiv Detail & Related papers (2025-05-26T13:06:18Z) - Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments [0.8796261172196743]
Vision-based target tracking is crucial for unmanned surface vehicles.<n>Real-time tracking in maritime environments is challenging due to dynamic camera movement, low visibility, and scale variation.<n>This study proposes a vision-guided object-tracking framework for USVs.
arXiv Detail & Related papers (2024-12-10T10:35:17Z) - Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation [11.267956604072845]
Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues.<n>We propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction.
arXiv Detail & Related papers (2024-10-11T03:54:48Z) - Trustworthy Image Semantic Communication with GenAI: Explainablity, Controllability, and Efficiency [59.15544887307901]
Image semantic communication (ISC) has garnered significant attention for its potential to achieve high efficiency in visual content transmission.
Existing ISC systems based on joint source-channel coding face challenges in interpretability, operability, and compatibility.
We propose a novel trustworthy ISC framework that employs Generative Artificial Intelligence (GenAI) for multiple downstream inference tasks.
arXiv Detail & Related papers (2024-08-07T14:32:36Z) - OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - Vision-Based Autonomous Navigation for Unmanned Surface Vessel in
Extreme Marine Conditions [2.8983738640808645]
This paper presents an autonomous vision-based navigation framework for tracking target objects in extreme marine conditions.
The proposed framework has been thoroughly tested in simulation under extremely reduced visibility due to sandstorms and fog.
The results are compared with state-of-the-art de-hazing methods across the benchmarked MBZIRC simulation dataset.
arXiv Detail & Related papers (2023-08-08T14:25:13Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Structured Scene Memory for Vision-Language Navigation [155.63025602722712]
We propose a crucial architecture for vision-language navigation (VLN)
It is compartmentalized enough to accurately memorize the percepts during navigation.
It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment.
arXiv Detail & Related papers (2021-03-05T03:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.