Related papers: An Approach to Multiple Comparison Benchmark Evaluations that is Stable Under Manipulation of the Comparate Set

An Approach to Multiple Comparison Benchmark Evaluations that is Stable Under Manipulation of the Comparate Set

URL: http://arxiv.org/abs/2305.11921v1
Date: Fri, 19 May 2023 08:58:55 GMT
Title: An Approach to Multiple Comparison Benchmark Evaluations that is Stable Under Manipulation of the Comparate Set
Authors: Ali Ismail-Fawaz, Angus Dempster, Chang Wei Tan, Matthieu Herrmann, Lynn Miller, Daniel F. Schmidt, Stefano Berretti, Jonathan Weber, Maxime Devanne, Germain Forestier, Geoffrey I. Webb
Abstract summary: We propose a new approach to presenting the results of benchmark comparisons, the Multiple Comparison Matrix (MCM) MCM prioritizes pairwise comparisons and precludes the means of manipulating experimental results in existing approaches. MCM is implemented in Python and is publicly available.
Score: 10.353747919337817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The measurement of progress using benchmarks evaluations is ubiquitous in computer science and machine learning. However, common approaches to analyzing and presenting the results of benchmark comparisons of multiple algorithms over multiple datasets, such as the critical difference diagram introduced by Dem\v{s}ar (2006), have important shortcomings and, we show, are open to both inadvertent and intentional manipulation. To address these issues, we propose a new approach to presenting the results of benchmark comparisons, the Multiple Comparison Matrix (MCM), that prioritizes pairwise comparisons and precludes the means of manipulating experimental results in existing approaches. MCM can be used to show the results of an all-pairs comparison, or to show the results of a comparison between one or more selected algorithms and the state of the art. MCM is implemented in Python and is publicly available.

Related papers

A Novel Pareto-optimal Ranking Method for Comparing Multi-objective Optimization Algorithms [2.889178722750616]
This paper proposes a novel multi-metric comparison method to rank the performance of multi-/many-objective optimization algorithms. Four different techniques are proposed to rank algorithms based on their contribution at each Pareto level. The techniques have broad applications in science and engineering, particularly in areas where multiple metrics are used for comparisons.
arXiv Detail & Related papers (2024-11-27T02:34:54Z)
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation [76.67608003501479]
We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators. The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z)
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare [99.57567498494448]
We introduce Compare2Score, an all-around LMM-based no-reference IQA model. During training, we generate scaled-up comparative instructions by comparing images from the same IQA dataset. Experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training.
arXiv Detail & Related papers (2024-05-29T17:26:09Z)
Performance Evaluation and Comparison of a New Regression Algorithm [4.125187280299247]
We compare the performance of a newly proposed regression algorithm against four conventional machine learning algorithms. The reader is free to replicate our results since we have provided the source code in a GitHub repository.
arXiv Detail & Related papers (2023-06-15T13:01:16Z)
Accounting for multiplicity in machine learning benchmark performance [0.0]
Using the highest-ranked performance as an estimate for state-of-the-art (SOTA) performance is a biased estimator, giving overly optimistic results. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided.
arXiv Detail & Related papers (2023-03-10T10:32:18Z)
Learning by Sorting: Self-supervised Learning with Group Ordering Constraints [75.89238437237445]
This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo) It exploits the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly. We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance.
arXiv Detail & Related papers (2023-01-05T11:17:55Z)
Prasatul Matrix: A Direct Comparison Approach for Analyzing Evolutionary Optimization Algorithms [2.1320960069210475]
Direct comparison approach is proposed to analyze the performance of evolutionary optimization algorithms. Five different performance measures are designed based on the prasatul matrix to evaluate the performance of algorithms.
arXiv Detail & Related papers (2022-12-01T17:21:44Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning [37.81297650369799]
unsupervised methods for learning image representations have reached impressive results on standard benchmarks. Many methods with substantially different implementations yield results that seem nearly identical on popular benchmarks. We compare methods using performance-based benchmarks such as linear evaluation, nearest neighbor classification, and clustering for several different datasets.
arXiv Detail & Related papers (2022-06-16T17:51:19Z)
Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items. We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons. We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z)
The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process. We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z)
Active Sampling for Pairwise Comparisons via Approximate Message Passing and Information Gain Maximization [5.771869590520189]
We propose ASAP, an active sampling algorithm based on approximate message passing and expected information gain. We show that ASAP offers the highest accuracy of inferred scores compared to the existing methods.
arXiv Detail & Related papers (2020-04-12T20:48:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.