Related papers: A Review of Benchmarks for Visual Defect Detection in the Manufacturing Industry

A Review of Benchmarks for Visual Defect Detection in the Manufacturing Industry

URL: http://arxiv.org/abs/2305.13261v1
Date: Fri, 5 May 2023 07:44:23 GMT
Title: A Review of Benchmarks for Visual Defect Detection in the Manufacturing Industry
Authors: Philippe Carvalho (Roberval), Alexandre Durupt (Roberval), Yves Grandvalet (Heudiasyc)
Abstract summary: We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases. A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks.
Score: 63.52264764099532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The field of industrial defect detection using machine learning and deep learning is a subject of active research. Datasets, also called benchmarks, are used to compare and assess research results. There is a number of datasets in industrial visual inspection, of varying quality. Thus, it is a difficult task to determine which dataset to use. Generally speaking, datasets which include a testing set, with precise labeling and made in real-world conditions should be preferred. We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases. A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks. We discuss our findings by examining the current state of benchmarks for industrial visual inspection, and by exposing guidelines on the usage of benchmarks.

Related papers

Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection [40.174488947319645]
Anomaly detection (AD) is essential for automating visual inspection in manufacturing. This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data; (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications; and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap.
arXiv Detail & Related papers (2025-03-30T14:11:46Z)
How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs [60.25940747590386]
We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. We profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source.
arXiv Detail & Related papers (2025-01-18T09:51:57Z)
More than Marketing? On the Information Value of AI Benchmarks for Practitioners [42.73526862595375]
In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. In product and policy, benchmarks were often found to be inadequate for informing substantive decisions. We conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals.
arXiv Detail & Related papers (2024-12-07T03:35:39Z)
Benchmark Data Repositories for Better Benchmarking [26.15831504718431]
In machine learning research, it is common to evaluate algorithms via their performance on benchmark datasets. We analyze the landscape of these $textitbenchmark data repositories and the role they can play in improving benchmarking.
arXiv Detail & Related papers (2024-10-31T16:30:08Z)
Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories. Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples. One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark. This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z)
ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. Each module requires benchmark designers to describe, justify, and support benchmark design choices. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark. We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z)
TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs [12.839640915518443]
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost. Recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs. We propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.
arXiv Detail & Related papers (2024-03-01T09:28:38Z)
Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. This synthetic data is employed to evaluate the robustness of pretrained segmenters. We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z)
TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction [131.7684896032888]
We present TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains. We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
arXiv Detail & Related papers (2023-11-16T04:43:03Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
AI applications in forest monitoring need remote sensing benchmark datasets [0.0]
We present requirements and considerations for the creation of rigorous, useful benchmarking datasets for forest monitoring applications. We list a set of example large-scale datasets that could contribute to benchmarking, and present a vision for how community-driven, representative benchmarking initiatives could benefit the field.
arXiv Detail & Related papers (2022-12-20T01:11:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.