Related papers: Competitions in AI -- Robustly Ranking Solvers Using Statistical Resampling

Competitions in AI -- Robustly Ranking Solvers Using Statistical Resampling

URL: http://arxiv.org/abs/2308.05062v1
Date: Wed, 9 Aug 2023 16:47:04 GMT
Title: Competitions in AI -- Robustly Ranking Solvers Using Statistical Resampling
Authors: Chris Fawcett, Mauro Vallati, Holger H. Hoos, Alfonso E. Gerevini
Abstract summary: We show that rankings resulting from the standard interpretation of competition results can be very sensitive to even minor changes in the benchmark instance set used as the basis for assessment. We introduce a novel approach to statistically meaningful analysis of competition results based on resampling performance data. Our approach produces confidence intervals of competition scores as well as statistically robust solver rankings with bounded error.
Score: 9.02080113915613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Solver competitions play a prominent role in assessing and advancing the state of the art for solving many problems in AI and beyond. Notably, in many areas of AI, competitions have had substantial impact in guiding research and applications for many years, and for a solver to be ranked highly in a competition carries considerable weight. But to which extent can we expect competition results to generalise to sets of problem instances different from those used in a particular competition? This is the question we investigate here, using statistical resampling techniques. We show that the rankings resulting from the standard interpretation of competition results can be very sensitive to even minor changes in the benchmark instance set used as the basis for assessment and can therefore not be expected to carry over to other samples from the same underlying instance distribution. To address this problem, we introduce a novel approach to statistically meaningful analysis of competition results based on resampling performance data. Our approach produces confidence intervals of competition scores as well as statistically robust solver rankings with bounded error. Applied to recent SAT, AI planning and computer vision competitions, our analysis reveals frequent statistical ties in solver performance as well as some inversions of ranks compared to the official results based on simple scoring.

Related papers

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition [70.60872754129832]
First NeurIPS competition on unlearning sought to stimulate the development of novel algorithms. Nearly 1,200 teams from across the world participated. We analyze top solutions and delve into discussions on benchmarking unlearning.
arXiv Detail & Related papers (2024-06-13T12:58:00Z)
Analysis of Systems' Performance in Natural Language Processing Competitions [6.197993866688085]
This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.
arXiv Detail & Related papers (2024-03-07T17:42:40Z)
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition [52.2034494666179]
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. We propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator.
arXiv Detail & Related papers (2024-02-04T15:17:09Z)
Benchmarking Robustness and Generalization in Multi-Agent Systems: A Case Study on Neural MMO [50.58083807719749]
We present the results of the second Neural MMO challenge, hosted at IJCAI 2022, which received 1600+ submissions. This competition targets robustness and generalization in multi-agent systems. We will open-source our benchmark including the environment wrapper, baselines, a visualization tool, and selected policies for further research.
arXiv Detail & Related papers (2023-08-30T07:16:11Z)
Comparison of classifiers in challenge scheme [12.030094148004176]
This paper analyzes the results of the MeOffendEs@IberLEF 2021 competition. It proposes to make inference through resampling techniques (bootstrap) to support Challenge organizers' decision-making.
arXiv Detail & Related papers (2023-05-16T23:38:34Z)
Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores. We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss. Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z)
A portfolio-based analysis method for competition results [0.8680676599607126]
I will describe a portfolio-based analysis method which can give complementary insights into the performance of participating solvers in a competition. The method is demonstrated on the results of the MiniZinc Challenges and new insights gained from the portfolio viewpoint are presented.
arXiv Detail & Related papers (2022-05-30T20:20:45Z)
Towards robust and domain agnostic reinforcement learning competitions [12.731614722371376]
Reinforcement learning competitions have formed the basis for standard research benchmarks. Despite this, a majority of challenges suffer from the same fundamental problems. We present a new framework of competition design that promotes the development of algorithms that overcome these barriers.
arXiv Detail & Related papers (2021-06-07T16:15:46Z)
Multi-Stage Decentralized Matching Markets: Uncertain Preferences and Strategic Behaviors [91.3755431537592]
This article develops a framework for learning optimal strategies in real-world matching markets. We show that there exists a welfare-versus-fairness trade-off that is characterized by the uncertainty level of acceptance. We prove that participants can be better off with multi-stage matching compared to single-stage matching.
arXiv Detail & Related papers (2021-02-13T19:25:52Z)
Analysing Affective Behavior in the First ABAW 2020 Competition [49.90617840789334]
The Affective Behavior Analysis in-the-wild (ABAW) 2020 Competition is the first Competition aiming at automatic analysis of the three main behavior tasks. We describe this Competition, to be held in conjunction with the IEEE Conference on Face and Gesture Recognition, May 2020, in Buenos Aires, Argentina. We outline the evaluation metrics, present both the baseline system and the top-3 performing teams' methodologies per Challenge and finally present their obtained results.
arXiv Detail & Related papers (2020-01-30T15:41:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.