An Item Response Theory-based R Module for Algorithm Portfolio Analysis
- URL: http://arxiv.org/abs/2408.14025v2
- Date: Tue, 27 Aug 2024 04:36:52 GMT
- Title: An Item Response Theory-based R Module for Algorithm Portfolio Analysis
- Authors: Brodie Oldfield, Sevvandi Kandanaarachchi, Ziqi Xu, Mario Andrés Muñoz,
- Abstract summary: This paper introduces an Item Response Theory based analysis tool for algorithm portfolio evaluation called AIRT-Module.
Adapting IRT to algorithm evaluation, the AIRT-Module contains a Shiny web application and the R package airt.
The strengths and weaknesses of algorithms are visualised using the difficulty spectrum of the test instances.
- Score: 2.8642825441965645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Experimental evaluation is crucial in AI research, especially for assessing algorithms across diverse tasks. Many studies often evaluate a limited set of algorithms, failing to fully understand their strengths and weaknesses within a comprehensive portfolio. This paper introduces an Item Response Theory (IRT) based analysis tool for algorithm portfolio evaluation called AIRT-Module. Traditionally used in educational psychometrics, IRT models test question difficulty and student ability using responses to test questions. Adapting IRT to algorithm evaluation, the AIRT-Module contains a Shiny web application and the R package airt. AIRT-Module uses algorithm performance measures to compute anomalousness, consistency, and difficulty limits for an algorithm and the difficulty of test instances. The strengths and weaknesses of algorithms are visualised using the difficulty spectrum of the test instances. AIRT-Module offers a detailed understanding of algorithm capabilities across varied test instances, thus enhancing comprehensive AI method assessment. It is available at https://sevvandi.shinyapps.io/AIRT/ .
Related papers
- Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG)
Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z) - Comprehensive Algorithm Portfolio Evaluation using Item Response Theory [0.19116784879310023]
IRT has been applied to evaluate machine learning algorithm performance on a single classification dataset.
We present a modified IRT-based framework for evaluating a portfolio of algorithms across a repository of datasets.
arXiv Detail & Related papers (2023-07-29T00:48:29Z) - Multi-Dimensional Ability Diagnosis for Machine Learning Algorithms [88.93372675846123]
We propose a task-agnostic evaluation framework Camilla for evaluating machine learning algorithms.
We use cognitive diagnosis assumptions and neural networks to learn the complex interactions among algorithms, samples and the skills of each sample.
In our experiments, Camilla outperforms state-of-the-art baselines on the metric reliability, rank consistency and rank stability.
arXiv Detail & Related papers (2023-07-14T03:15:56Z) - Representation Learning with Multi-Step Inverse Kinematics: An Efficient
and Optimal Approach to Rich-Observation RL [106.82295532402335]
Existing reinforcement learning algorithms suffer from computational intractability, strong statistical assumptions, and suboptimal sample complexity.
We provide the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level.
Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics.
arXiv Detail & Related papers (2023-04-12T14:51:47Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI
Evaluation Methods into an Interactive and Multi-dimensional Benchmark [6.511859672210113]
We propose Compare-xAI, a benchmark that unifies all exclusive functional testing methods applied to xAI algorithms.
The benchmark encapsulates the complexity of evaluating xAI methods into a hierarchical scoring of three levels.
The interactive user interface helps mitigate errors in interpreting xAI results.
arXiv Detail & Related papers (2022-06-08T06:13:39Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - DERAIL: Diagnostic Environments for Reward And Imitation Learning [9.099589602551573]
We develop a suite of diagnostic tasks that test individual facets of algorithm performance in isolation.
Results confirm that algorithm performance is highly sensitive to implementation details.
Case-study shows how the suite can pinpoint design flaws and rapidly evaluate candidate solutions.
arXiv Detail & Related papers (2020-12-02T18:07:09Z) - Measuring the Complexity of Domains Used to Evaluate AI Systems [0.48951183832371004]
We propose a theory for measuring the complexity between varied domains.
An application of this measure is then demonstrated to show its effectiveness as a tool in varied situations.
We propose the future use of such a complexity metric for use in computing an AI system's intelligence.
arXiv Detail & Related papers (2020-09-18T21:53:07Z) - Provably Efficient Exploration for Reinforcement Learning Using
Unsupervised Learning [96.78504087416654]
Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems, we investigate when this paradigm is provably efficient.
We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a noregret tabular RL algorithm.
arXiv Detail & Related papers (2020-03-15T19:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.