An Approach to Multiple Comparison Benchmark Evaluations that is Stable
Under Manipulation of the Comparate Set
- URL: http://arxiv.org/abs/2305.11921v1
- Date: Fri, 19 May 2023 08:58:55 GMT
- Title: An Approach to Multiple Comparison Benchmark Evaluations that is Stable
Under Manipulation of the Comparate Set
- Authors: Ali Ismail-Fawaz, Angus Dempster, Chang Wei Tan, Matthieu Herrmann,
Lynn Miller, Daniel F. Schmidt, Stefano Berretti, Jonathan Weber, Maxime
Devanne, Germain Forestier, Geoffrey I. Webb
- Abstract summary: We propose a new approach to presenting the results of benchmark comparisons, the Multiple Comparison Matrix (MCM)
MCM prioritizes pairwise comparisons and precludes the means of manipulating experimental results in existing approaches.
MCM is implemented in Python and is publicly available.
- Score: 10.353747919337817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The measurement of progress using benchmarks evaluations is ubiquitous in
computer science and machine learning. However, common approaches to analyzing
and presenting the results of benchmark comparisons of multiple algorithms over
multiple datasets, such as the critical difference diagram introduced by
Dem\v{s}ar (2006), have important shortcomings and, we show, are open to both
inadvertent and intentional manipulation. To address these issues, we propose a
new approach to presenting the results of benchmark comparisons, the Multiple
Comparison Matrix (MCM), that prioritizes pairwise comparisons and precludes
the means of manipulating experimental results in existing approaches. MCM can
be used to show the results of an all-pairs comparison, or to show the results
of a comparison between one or more selected algorithms and the state of the
art. MCM is implemented in Python and is publicly available.
Related papers
- POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation [76.67608003501479]
We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators.
The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z) - Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare [99.57567498494448]
We introduce Compare2Score, an all-around LMM-based no-reference IQA model.
During training, we generate scaled-up comparative instructions by comparing images from the same IQA dataset.
Experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training.
arXiv Detail & Related papers (2024-05-29T17:26:09Z) - Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons [10.94304714004328]
This paper introduces a Product of Expert (PoE) framework for efficient Comparative Assessment.
Individual comparisons are considered experts that provide information on a pair's score difference.
PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates.
arXiv Detail & Related papers (2024-05-09T16:45:27Z) - Performance Evaluation and Comparison of a New Regression Algorithm [4.125187280299247]
We compare the performance of a newly proposed regression algorithm against four conventional machine learning algorithms.
The reader is free to replicate our results since we have provided the source code in a GitHub repository.
arXiv Detail & Related papers (2023-06-15T13:01:16Z) - Learning by Sorting: Self-supervised Learning with Group Ordering
Constraints [75.89238437237445]
This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo)
It exploits the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly.
We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance.
arXiv Detail & Related papers (2023-01-05T11:17:55Z) - Prasatul Matrix: A Direct Comparison Approach for Analyzing Evolutionary
Optimization Algorithms [2.1320960069210475]
Direct comparison approach is proposed to analyze the performance of evolutionary optimization algorithms.
Five different performance measures are designed based on the prasatul matrix to evaluate the performance of algorithms.
arXiv Detail & Related papers (2022-12-01T17:21:44Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Beyond Supervised vs. Unsupervised: Representative Benchmarking and
Analysis of Image Representation Learning [37.81297650369799]
unsupervised methods for learning image representations have reached impressive results on standard benchmarks.
Many methods with substantially different implementations yield results that seem nearly identical on popular benchmarks.
We compare methods using performance-based benchmarks such as linear evaluation, nearest neighbor classification, and clustering for several different datasets.
arXiv Detail & Related papers (2022-06-16T17:51:19Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Active Sampling for Pairwise Comparisons via Approximate Message Passing
and Information Gain Maximization [5.771869590520189]
We propose ASAP, an active sampling algorithm based on approximate message passing and expected information gain.
We show that ASAP offers the highest accuracy of inferred scores compared to the existing methods.
arXiv Detail & Related papers (2020-04-12T20:48:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.