Related papers: Domain-Robust Marine Plastic Detection Using Vision Models

Domain-Robust Marine Plastic Detection Using Vision Models

URL: http://arxiv.org/abs/2510.03294v1
Date: Mon, 29 Sep 2025 17:15:07 GMT
Title: Domain-Robust Marine Plastic Detection Using Vision Models
Authors: Saanvi Kataria,
Abstract summary: This study benchmarks models for cross-domain robustness, training convolutional neural networks and vision transformers.<n>Two zero-shot models were assessed, CLIP ViT-L14 and Google's Gemini 2.0 Flash, that leverage pretraining to classify images without fine-tuning.<n>Results show the lightweight MobileNetV2 delivers the strongest cross-domain performance (F1 0.97), surpassing larger models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Marine plastic pollution is a pressing environmental threat, making reliable automation for underwater debris detection essential. However, vision systems trained on one dataset often degrade on new imagery due to domain shift. This study benchmarks models for cross-domain robustness, training convolutional neural networks - CNNs (MobileNetV2, ResNet-18, EfficientNet-B0) and vision transformers (DeiT-Tiny, ViT-B16) on a labeled underwater dataset and then evaluates them on a balanced cross-domain test set built from plastic-positive images drawn from a different source and negatives from the training domain. Two zero-shot models were assessed, CLIP ViT-L14 and Google's Gemini 2.0 Flash, that leverage pretraining to classify images without fine-tuning. Results show the lightweight MobileNetV2 delivers the strongest cross-domain performance (F1 0.97), surpassing larger models. All fine-tuned models achieved high Precision (around 99%), but differ in Recall, indicating varying sensitivity to plastic instances. Zero-shot CLIP is comparatively sensitive (Recall around 80%) yet prone to false positives (Precision around 56%), whereas Gemini exhibits the inverse profile (Precision around 99%, Recall around 81%). Error analysis highlights recurring confusions with coral textures, suspended particulates, and specular glare. Overall, compact CNNs with supervised training can generalize effectively for cross-domain underwater detection, while large pretrained vision-language models provide complementary strengths.

Related papers

The Weight of a Bit: EMFI Sensitivity Analysis of Embedded Deep Learning Models [3.2913641623634913]
We investigate how four different number representations influence the success of an EMFI attack on embedded neural network models.<n>We deployed four common image classifiers, ResNet-18, ResNet-34, ResNet-50, and VGG-11, on an embedded memory chip, and utilized a low-cost EMFI platform to trigger faults.<n>Our results show that while floating-point representations exhibit almost a complete degradation in accuracy (Top-1 and Top-5) after a single fault injection, integer representations offer better resistance overall.
arXiv Detail & Related papers (2026-02-18T09:40:29Z)
Benchmarking IoT Time-Series AD with Event-Level Augmentations [34.864214444544565]
We introduce an evaluation protocol with unified event-level augmentations that simulate real-world issues.<n>We evaluate 14 representative models on five public anomaly datasets.
arXiv Detail & Related papers (2026-02-17T09:45:44Z)
From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials [1.0483690290582848]
We present a general-purpose self-supervised visual model with hierarchical inference based on botanical taxonomy.<n>The model significantly improved species identification (F1-score: 0.52 to 0.85, R-squared: 0.75 to 0.98) and damage classification (F1-score: 0.28 to 0.44, R-squared: 0.71 to 0.87) over prior models.<n>It is now deployed in BASF's phenotyping pipeline, enabling large-scale, automated crop and weed monitoring across diverse geographies.
arXiv Detail & Related papers (2025-08-11T00:08:42Z)
DiRecNetV2: A Transformer-Enhanced Network for Aerial Disaster Recognition [4.678150356894011]
integration of Unmanned Aerial Vehicles with artificial intelligence (AI) models for aerial imagery processing in disaster assessment requires exceptional accuracy, computational efficiency, and real-time processing capabilities. Traditionally Convolutional Neural Networks (CNNs) demonstrate efficiency in local feature extraction but are limited by their potential for global context interpretation. Vision Transformers (ViTs) show promise for improved global context interpretation through the use of attention mechanisms, although they still remain underinvestigated in UAV-based disaster response applications.
arXiv Detail & Related papers (2024-10-17T15:25:13Z)
Classification robustness to common optical aberrations [64.08840063305313]
This paper proposes OpticsBench, a benchmark for investigating robustness to realistic, practically relevant optical blur effects. Experiments on ImageNet show that for a variety of different pre-trained DNNs, the performance varies strongly compared to disk-shaped kernels. We show on ImageNet-100 with OpticsAugment that can be increased by using optical kernels as data augmentation.
arXiv Detail & Related papers (2023-08-29T08:36:00Z)
Large-scale Robustness Analysis of Video Action Recognition Models [10.017292176162302]
We study robustness of six state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2.
arXiv Detail & Related papers (2022-07-04T13:29:34Z)
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks [82.21746840893658]
This paper investigates the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. We show that while the ResNet-18 model trained on DWT spectrograms achieves a high recognition accuracy, attacking this model is relatively more costly for the adversary.
arXiv Detail & Related papers (2022-04-14T15:14:08Z)
Core Risk Minimization using Salient ImageNet [53.616101711801484]
We introduce the Salient Imagenet dataset with more than 1 million soft masks localizing core and spurious features for all 1000 Imagenet classes. Using this dataset, we first evaluate the reliance of several Imagenet pretrained models (42 total) on spurious features. Next, we introduce a new learning paradigm called Core Risk Minimization (CoRM) whose objective ensures that the model predicts a class using its core features.
arXiv Detail & Related papers (2022-03-28T01:53:34Z)
Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z)
A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes. We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes. In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z)
Forward-Looking Sonar Patch Matching: Modern CNNs, Ensembling, and Uncertainty [0.0]
Convolutional Neural Network (CNN) learns a similarity function and predicts if two input sonar images are similar or not. Best performing models are DenseNet Two-Channel network with 0.955 AUC, VGG-Siamese with contrastive loss at 0.949 AUC and DenseNet Siamese with 0.921 AUC.
arXiv Detail & Related papers (2021-08-02T17:49:56Z)
Contemplating real-world object classification [53.10151901863263]
We reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations. We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement.
arXiv Detail & Related papers (2021-03-08T23:29:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.