FishDet-M: A Unified Large-Scale Benchmark for Robust Fish Detection and CLIP-Guided Model Selection in Diverse Aquatic Visual Domains
- URL: http://arxiv.org/abs/2507.17859v1
- Date: Wed, 23 Jul 2025 18:32:01 GMT
- Title: FishDet-M: A Unified Large-Scale Benchmark for Robust Fish Detection and CLIP-Guided Model Selection in Diverse Aquatic Visual Domains
- Authors: Muayad Abujabal, Lyes Saad Saoud, Irfan Hussain,
- Abstract summary: FishDet-M is the largest unified benchmark for fish detection, comprising 13 publicly available datasets spanning diverse aquatic environments.<n>All data are harmonized using COCO-style annotations with both bounding boxes and segmentation masks.<n>FishDet-M establishes a standardized and reproducible platform for evaluating object detection in complex aquatic scenes.
- Score: 1.3791394805787949
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Accurate fish detection in underwater imagery is essential for ecological monitoring, aquaculture automation, and robotic perception. However, practical deployment remains limited by fragmented datasets, heterogeneous imaging conditions, and inconsistent evaluation protocols. To address these gaps, we present \textit{FishDet-M}, the largest unified benchmark for fish detection, comprising 13 publicly available datasets spanning diverse aquatic environments including marine, brackish, occluded, and aquarium scenes. All data are harmonized using COCO-style annotations with both bounding boxes and segmentation masks, enabling consistent and scalable cross-domain evaluation. We systematically benchmark 28 contemporary object detection models, covering the YOLOv8 to YOLOv12 series, R-CNN based detectors, and DETR based models. Evaluations are conducted using standard metrics including mAP, mAP@50, and mAP@75, along with scale-specific analyses (AP$_S$, AP$_M$, AP$_L$) and inference profiling in terms of latency and parameter count. The results highlight the varying detection performance across models trained on FishDet-M, as well as the trade-off between accuracy and efficiency across models of different architectures. To support adaptive deployment, we introduce a CLIP-based model selection framework that leverages vision-language alignment to dynamically identify the most semantically appropriate detector for each input image. This zero-shot selection strategy achieves high performance without requiring ensemble computation, offering a scalable solution for real-time applications. FishDet-M establishes a standardized and reproducible platform for evaluating object detection in complex aquatic scenes. All datasets, pretrained models, and evaluation tools are publicly available to facilitate future research in underwater computer vision and intelligent marine systems.
Related papers
- Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models [0.0]
We present a benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets.<n>Our results show that large-scale models trained on terrestrial data (real or synthetic) are effective in in-air settings, but perform poorly underwater.<n>This study presents a detailed evaluation and visualization of monocular metric depth estimation in underwater scenes.
arXiv Detail & Related papers (2025-07-02T21:06:39Z) - YH-MINER: Multimodal Intelligent System for Natural Ecological Reef Metric Extraction [23.4289262373633]
Coral reefs, crucial for sustaining marine biodiversity and ecological processes, face escalating threats.<n>This study develops the YH-MINER system, establishing an intelligent framework for "object detection-semantic segmentation-prior input"<n>The system achieves genus-level classification accuracy of 88% and simultaneously extracting core ecological metrics.
arXiv Detail & Related papers (2025-05-28T11:36:18Z) - UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset [62.00529957144851]
We propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories.<n>We then introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances.<n>We show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets.
arXiv Detail & Related papers (2025-05-21T14:36:01Z) - RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement [59.364418120895]
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications.<n>We develop a novel relation-driven Mamba framework for effective UIE (RD-UIE)<n>Experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments [57.59857784298534]
We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images.<n>This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes.
arXiv Detail & Related papers (2025-03-06T05:13:19Z) - A Practical Approach to Underwater Depth and Surface Normals Estimation [3.0516727053033392]
This paper presents a novel deep learning model for Monocular Depth and Surface Normals Estimation (MDSNE)<n>It is specifically tailored for underwater environments, using a hybrid architecture that integrates CNNs with Transformers.<n>Our model reduces parameters by 90% and training costs by 80%, allowing real-time 3D perception on resource-constrained devices.
arXiv Detail & Related papers (2024-10-02T22:41:12Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - DARTH: Holistic Test-time Adaptation for Multiple Object Tracking [87.72019733473562]
Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving.
Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed.
We introduce DARTH, a holistic test-time adaptation framework for MOT.
arXiv Detail & Related papers (2023-10-03T10:10:42Z) - FishMOT: A Simple and Effective Method for Fish Tracking Based on IoU
Matching [11.39414015803651]
FishMOT is a novel fish tracking approach combining object detection and objectoU matching.
The method exhibits excellent robustness and generalizability for varying environments and fish numbers.
arXiv Detail & Related papers (2023-09-06T13:16:41Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - SVAM: Saliency-guided Visual Attention Modeling by Autonomous Underwater
Robots [16.242924916178282]
This paper presents a holistic approach to saliency-guided visual attention modeling (SVAM) for use by autonomous underwater robots.
Our proposed model, named SVAM-Net, integrates deep visual features at various scales and semantics for effective salient object detection (SOD) in natural underwater images.
arXiv Detail & Related papers (2020-11-12T08:17:21Z) - A Realistic Fish-Habitat Dataset to Evaluate Algorithms for Underwater
Visual Analysis [2.6476746128312194]
We present DeepFish as a benchmark suite with a large-scale dataset to train and test methods for several computer vision tasks.
The dataset consists of approximately 40 thousand images collected underwater from 20 greenhabitats in the marine-environments of tropical Australia.
Our experiments provide an in-depth analysis of the dataset characteristics, and the performance evaluation of several state-of-the-art approaches.
arXiv Detail & Related papers (2020-08-28T12:20:59Z) - Overcoming Classifier Imbalance for Long-tail Object Detection with
Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution.
We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training.
Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.