A Review of Benchmarks for Visual Defect Detection in the Manufacturing
Industry
- URL: http://arxiv.org/abs/2305.13261v1
- Date: Fri, 5 May 2023 07:44:23 GMT
- Title: A Review of Benchmarks for Visual Defect Detection in the Manufacturing
Industry
- Authors: Philippe Carvalho (Roberval), Alexandre Durupt (Roberval), Yves
Grandvalet (Heudiasyc)
- Abstract summary: We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases.
A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks.
- Score: 63.52264764099532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of industrial defect detection using machine learning and deep
learning is a subject of active research. Datasets, also called benchmarks, are
used to compare and assess research results. There is a number of datasets in
industrial visual inspection, of varying quality. Thus, it is a difficult task
to determine which dataset to use. Generally speaking, datasets which include a
testing set, with precise labeling and made in real-world conditions should be
preferred. We propose a study of existing benchmarks to compare and expose
their characteristics and their use-cases. A study of industrial metrics
requirements, as well as testing procedures, will be presented and applied to
the studied benchmarks. We discuss our findings by examining the current state
of benchmarks for industrial visual inspection, and by exposing guidelines on
the usage of benchmarks.
Related papers
- ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.
We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs [12.839640915518443]
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost.
Recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs.
We propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.
arXiv Detail & Related papers (2024-03-01T09:28:38Z) - TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction [131.7684896032888]
We present TextEE, a standardized, fair, and reproducible benchmark for event extraction.
TextEE comprises standardized data preprocessing scripts and splits for 16 datasets spanning eight diverse domains.
We evaluate five varied large language models on our TextEE benchmark and demonstrate how they struggle to achieve satisfactory performance.
arXiv Detail & Related papers (2023-11-16T04:43:03Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection [55.70982767084996]
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark.
We present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions.
DeepfakeBench contains 15 state-of-the-art detection methods, 9CL datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations.
arXiv Detail & Related papers (2023-07-04T01:34:41Z) - Re-Benchmarking Pool-Based Active Learning for Binary Classification [27.034593234956713]
Active learning is a paradigm that significantly enhances the performance of machine learning models when acquiring labeled data.
While several benchmarks exist for evaluating active learning strategies, their findings exhibit some misalignment.
This discrepancy motivates us to develop a transparent and reproducible benchmark for the community.
arXiv Detail & Related papers (2023-06-15T08:47:50Z) - AI applications in forest monitoring need remote sensing benchmark
datasets [0.0]
We present requirements and considerations for the creation of rigorous, useful benchmarking datasets for forest monitoring applications.
We list a set of example large-scale datasets that could contribute to benchmarking, and present a vision for how community-driven, representative benchmarking initiatives could benefit the field.
arXiv Detail & Related papers (2022-12-20T01:11:40Z) - Personalized Benchmarking with the Ludwig Benchmarking Toolkit [12.347185532330919]
Ludwig Benchmarking Toolkit (LBT) is a personalized benchmarking toolkit for running end-to-end benchmark studies.
LBT provides an interface for controlling training and customizing evaluation, a standardized training framework for eliminating confounding variables, and support for multi-objective evaluation.
We show how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets.
arXiv Detail & Related papers (2021-11-08T03:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.