A Review of Benchmarks for Visual Defect Detection in the Manufacturing
Industry
- URL: http://arxiv.org/abs/2305.13261v1
- Date: Fri, 5 May 2023 07:44:23 GMT
- Title: A Review of Benchmarks for Visual Defect Detection in the Manufacturing
Industry
- Authors: Philippe Carvalho (Roberval), Alexandre Durupt (Roberval), Yves
Grandvalet (Heudiasyc)
- Abstract summary: We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases.
A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks.
- Score: 63.52264764099532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of industrial defect detection using machine learning and deep
learning is a subject of active research. Datasets, also called benchmarks, are
used to compare and assess research results. There is a number of datasets in
industrial visual inspection, of varying quality. Thus, it is a difficult task
to determine which dataset to use. Generally speaking, datasets which include a
testing set, with precise labeling and made in real-world conditions should be
preferred. We propose a study of existing benchmarks to compare and expose
their characteristics and their use-cases. A study of industrial metrics
requirements, as well as testing procedures, will be presented and applied to
the studied benchmarks. We discuss our findings by examining the current state
of benchmarks for industrial visual inspection, and by exposing guidelines on
the usage of benchmarks.
Related papers
- Benchmark Data Repositories for Better Benchmarking [26.15831504718431]
In machine learning research, it is common to evaluate algorithms via their performance on benchmark datasets.
We analyze the landscape of these $textitbenchmark data repositories and the role they can play in improving benchmarking.
arXiv Detail & Related papers (2024-10-31T16:30:08Z) - Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs [12.839640915518443]
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost.
Recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs.
We propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.
arXiv Detail & Related papers (2024-03-01T09:28:38Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction [131.7684896032888]
We present TextEE, a standardized, fair, and reproducible benchmark for event extraction.
TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains.
We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
arXiv Detail & Related papers (2023-11-16T04:43:03Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - AI applications in forest monitoring need remote sensing benchmark
datasets [0.0]
We present requirements and considerations for the creation of rigorous, useful benchmarking datasets for forest monitoring applications.
We list a set of example large-scale datasets that could contribute to benchmarking, and present a vision for how community-driven, representative benchmarking initiatives could benefit the field.
arXiv Detail & Related papers (2022-12-20T01:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.