Related papers: Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

URL: http://arxiv.org/abs/2503.06978v1
Date: Mon, 10 Mar 2025 06:47:38 GMT
Title: Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition
Authors: Xinyu Xi, Hua Yang, Shentai Zhang, Yijie Liu, Sijin Sun, Xiuju Fu,
Abstract summary: Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics.<n>Our framework integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM)<n>Our model achieves 98$%$ accuracy, surpassing previous SOTA models by 3.5$%$.<n>This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.
Score: 5.667043618885205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.

Related papers

Secure Low-altitude Maritime Communications via Intelligent Jamming [53.42658269206017]
Low-altitude wireless networks (LAWNs) have emerged as a viable solution for maritime communications.<n>The open and clear UAV communication channels make maritime LAWNs vulnerable to eavesdropping attacks.<n>We propose a low-altitude maritime communication system that employs intelligent jamming to counter dynamic eavesdroppers.
arXiv Detail & Related papers (2025-11-10T03:16:19Z)
Multi-Model Synthetic Training for Mission-Critical Small Language Models [0.0]
We present a novel approach that achieves a 261x cost reduction for maritime intelligence.<n>Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs.<n>The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference.
arXiv Detail & Related papers (2025-09-16T13:04:48Z)
BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion [6.8723394189831035]
Large language models pose challenges for deployment in resource-constrained environments.<n>We propose a lightweight MLLM framework for end-to-end visual question answering.<n>Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language optimised for efficient multimodal understanding.
arXiv Detail & Related papers (2025-09-10T16:09:49Z)
AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration [0.0]
This paper introduces the Adaptive Quantized Planetary Crater Detection System (AQ-PCDSys)<n>AQ-PCDSys integrates a Quantized Neural Network (QNN) architecture, trained using Quantization-Aware Training (QAT)<n>The AMF module intelligently fuses data from Optical Imagery (OI) and Digital Elevation Models (DEMs) at the feature level.
arXiv Detail & Related papers (2025-08-25T13:44:00Z)
ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving [62.9051914830949]
We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving.<n>A lightweight acquisition pipeline ensures scalable collection, while sparse but statistically sufficient ground truth supports robust training.<n> Benchmarking with state-of-the-art monocular depth models reveals severe cross-dataset generalization failures.
arXiv Detail & Related papers (2025-08-19T16:13:49Z)
Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation [53.16213723669751]
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding.<n>However, their direct deployment is often hindered by high computational complexity and resource requirements.<n>This paper proposes a novel knowledge distillation based semantic communication framework.
arXiv Detail & Related papers (2025-08-04T07:47:18Z)
VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
Learning Underwater Active Perception in Simulation [51.205673783866146]
Turbidity can jeopardise the whole mission as it may prevent correct visual documentation of the inspected structures. Previous works have introduced methods to adapt to turbidity and backscattering. We propose a simple yet efficient approach to enable high-quality image acquisition of assets in a broad range of water conditions.
arXiv Detail & Related papers (2025-04-23T06:48:38Z)
An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z)
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework for AGI quality assessment.<n>It includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated.<n>Experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-02-21T03:05:45Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
Quantization-Aware Imitation-Learning for Resource-Efficient Robotic Control [11.365124223329582]
We propose a new quantization framework for IL-based policy models that fine-tunes parameters to enhance robustness against low-bit precision errors.<n>Our evaluations with robot manipulation for 4-bit weight-quantization on a real edge GPU demonstrate that our framework achieves up to 2.5x speedup and 2.5x energy savings.<n>These results highlight the practical potential of deploying IL-based policy models on resource-constrained devices.
arXiv Detail & Related papers (2024-12-02T01:33:49Z)
Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
LAR-IQA: A Lightweight, Accurate, and Robust No-Reference Image Quality Assessment Model [6.074775040047959]
We propose a compact, lightweight NR-IQA model that achieves state-of-the-art (SOTA) performance on ECCV AIM UHD-IQA challenge validation and test datasets. Our model features a dual-branch architecture, with each branch separately trained on synthetically and authentically distorted images. Our evaluation considering various open-source datasets highlights the practical, high-accuracy, and robust performance of our proposed lightweight model.
arXiv Detail & Related papers (2024-08-30T07:32:19Z)
VmambaIR: Visual State Space Model for Image Restoration [36.11385876754612]
We propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters.
arXiv Detail & Related papers (2024-03-18T02:38:55Z)
Multi-Hierarchical Surrogate Learning for Structural Dynamical Crash Simulations Using Graph Convolutional Neural Networks [5.582881461692378]
We propose a multi-hierarchical framework for structurally creating a series of surrogate models for a kart frame. For multiscale phenomena, macroscale features are captured on a coarse surrogate, whereas microscale effects are resolved by finer ones. We train a graph-convolutional neural network-based surrogate that learns parameter-dependent low-dimensional latent dynamics on the coarsest representation.
arXiv Detail & Related papers (2024-02-14T15:22:59Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
Semantic-aware Texture-Structure Feature Collaboration for Underwater Image Enhancement [58.075720488942125]
Underwater image enhancement has become an attractive topic as a significant technology in marine engineering and aquatic robotics. We develop an efficient and compact enhancement network in collaboration with a high-level semantic-aware pretrained model. We also apply the proposed algorithm to the underwater salient object detection task to reveal the favorable semantic-aware ability for high-level vision tasks.
arXiv Detail & Related papers (2022-11-19T07:50:34Z)
Interpretable Hyperspectral AI: When Non-Convex Modeling meets Hyperspectral Remote Sensing [57.52865154829273]
Hyperspectral imaging, also known as image spectrometry, is a landmark technique in geoscience remote sensing (RS) In the past decade efforts have been made to process analyze these hyperspectral (HS) products mainly by means of seasoned experts. For this reason, it is urgent to develop more intelligent and automatic approaches for various HS RS applications.
arXiv Detail & Related papers (2021-03-02T03:32:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.