Algorithm Performance Spaces for Strategic Dataset Selection
- URL: http://arxiv.org/abs/2505.01442v1
- Date: Tue, 29 Apr 2025 12:29:52 GMT
- Title: Algorithm Performance Spaces for Strategic Dataset Selection
- Authors: Steffen Schulz,
- Abstract summary: The evaluation of new algorithms in recommender systems frequently depends on publicly available datasets, such as those from MovieLens or Amazon.<n>This thesis introduces the Algorithm Performance Space, a framework designed to differentiate datasets based on the measured performance of algorithms applied to them.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of new algorithms in recommender systems frequently depends on publicly available datasets, such as those from MovieLens or Amazon. Some of these datasets are being disproportionately utilized primarily due to their historical popularity as baselines rather than their suitability for specific research contexts. This thesis addresses this issue by introducing the Algorithm Performance Space, a novel framework designed to differentiate datasets based on the measured performance of algorithms applied to them. An experimental study proposes three metrics to quantify and justify dataset selection to evaluate new algorithms. These metrics also validate assumptions about datasets, such as the similarity between MovieLens datasets of varying sizes. By creating an Algorithm Performance Space and using the proposed metrics, differentiating datasets was made possible, and diverse dataset selections could be found. While the results demonstrate the framework's potential, further research proposals and implications are discussed to develop Algorithm Performance Spaces tailored to diverse use cases.
Related papers
- ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLMs [22.68421108740517]
We propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm.<n> Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods.
arXiv Detail & Related papers (2025-06-12T01:53:01Z) - Adaptive Bounded Exploration and Intermediate Actions for Data Debiasing [18.87576995391638]
We propose algorithms for sequentially debiasing the training dataset through adaptive and bounded exploration.<n>Our proposed algorithms balance between the ultimate goal of mitigating the impacts of data biases -- which will in turn lead to more accurate and fairer decisions.
arXiv Detail & Related papers (2025-04-10T22:22:23Z) - From Variability to Stability: Advancing RecSys Benchmarking Practices [3.3331198926331784]
This paper introduces a novel benchmarking methodology to facilitate a fair and robust comparison of RecSys algorithms.
By utilizing a diverse set of $30$ open datasets, including two introduced in this work, we critically examine the influence of dataset characteristics on algorithm performance.
arXiv Detail & Related papers (2024-02-15T07:35:52Z) - Data Augmentations in Deep Weight Spaces [89.45272760013928]
We introduce a novel augmentation scheme based on the Mixup method.
We evaluate the performance of these techniques on existing benchmarks as well as new benchmarks we generate.
arXiv Detail & Related papers (2023-11-15T10:43:13Z) - Synthetic Data for Feature Selection [5.8010446129208155]
We propose a collection of synthetic datasets that can be used as a common reference point for feature selection algorithms.
The proposed datasets are based on applications from electronics in order to mimic real life scenarios.
The datasets are made publicly available on GitHub and can be used by researchers to evaluate feature selection algorithms.
arXiv Detail & Related papers (2022-11-06T05:57:01Z) - Early Time-Series Classification Algorithms: An Empirical Comparison [59.82930053437851]
Early Time-Series Classification (ETSC) is the task of predicting the class of incoming time-series by observing as few measurements as possible.
We evaluate six existing ETSC algorithms on publicly available data, as well as on two newly introduced datasets.
arXiv Detail & Related papers (2022-03-03T10:43:56Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - A Framework and Benchmarking Study for Counterfactual Generating Methods
on Tabular Data [0.0]
Counterfactual explanations are viewed as an effective way to explain machine learning predictions.
There are already dozens of algorithms aiming to generate such explanations.
benchmarking study and framework can help practitioners in determining which technique and building blocks most suit their context.
arXiv Detail & Related papers (2021-07-09T21:06:03Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.