SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification
- URL: http://arxiv.org/abs/2505.18015v1
- Date: Fri, 23 May 2025 15:17:45 GMT
- Title: SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification
- Authors: Shashank Agnihotri, David Schader, Jonas Jakubassa, Nico Sharei, Simon Kral, Mehmet Ege Kaçar, Ruben Weber, Margret Keuper,
- Abstract summary: We provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations.<n>We benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets.<n>Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity.
- Score: 9.779472630665982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
Related papers
- Ensemble-Based Deepfake Detection using State-of-the-Art Models with Robust Cross-Dataset Generalisation [0.0]
Machine learning-based Deepfake detection models have achieved impressive results on benchmark datasets.<n>But their performance often deteriorates significantly when evaluated on out-of-distribution data.<n>In this work, we investigate an ensemble-based approach for improving the generalization of deepfake detection systems.
arXiv Detail & Related papers (2025-07-08T13:54:48Z) - On the Robustness of Human-Object Interaction Detection against Distribution Shift [27.40641711088878]
Human-Object Interaction (HOI) detection has seen substantial advances in recent years.<n>Existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts.<n>In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts.
arXiv Detail & Related papers (2025-06-22T13:01:34Z) - Hybrid-Segmentor: A Hybrid Approach to Automated Fine-Grained Crack Segmentation in Civil Infrastructure [52.2025114590481]
We introduce Hybrid-Segmentor, an encoder-decoder based approach that is capable of extracting both fine-grained local and global crack features.
This allows the model to improve its generalization capabilities in distinguish various type of shapes, surfaces and sizes of cracks.
The proposed model outperforms existing benchmark models across 5 quantitative metrics (accuracy 0.971, precision 0.804, recall 0.744, F1-score 0.770, and IoU score 0.630), achieving state-of-the-art status.
arXiv Detail & Related papers (2024-09-04T16:47:16Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Robustness Analysis on Foundational Segmentation Models [28.01242494123917]
In this work, we perform a robustness analysis of Visual Foundation Models (VFMs) for segmentation tasks.
We benchmark seven state-of-the-art segmentation architectures using 2 different datasets.
Our findings reveal several key insights: VFMs exhibit vulnerabilities to compression-induced corruptions, despite not outpacing all of unimodal models in robustness, multimodal models show competitive resilience in zero-shot scenarios, and VFMs demonstrate enhanced robustness for certain object categories.
arXiv Detail & Related papers (2023-06-15T16:59:42Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - Benchmarking the Robustness of Instance Segmentation Models [7.1699725781322465]
This paper presents a comprehensive evaluation of instance segmentation models with respect to real-world image corruptions and out-of-domain image collections.<n>We find that group normalization enhances the robustness of networks across corruptions where the image contents stay the same but corruptions are added on top.<n>We also find that single-stage detectors do not generalize well to larger image resolutions than their training size.
arXiv Detail & Related papers (2021-09-02T17:50:07Z) - Estimating the Robustness of Classification Models by the Structure of
the Learned Feature-Space [10.418647759223964]
We argue that fixed testsets are only able to capture a small portion of possible data variations and are thus limited and prone to generate new overfitted solutions.
To overcome these drawbacks, we suggest to estimate the robustness of a model directly from the structure of its learned feature-space.
arXiv Detail & Related papers (2021-06-23T10:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.