Related papers: Optimal Dataset Size for Recommender Systems: Evaluating Algorithms' Performance via Downsampling

Optimal Dataset Size for Recommender Systems: Evaluating Algorithms' Performance via Downsampling

URL: http://arxiv.org/abs/2502.08845v2
Date: Fri, 14 Feb 2025 09:35:58 GMT
Title: Optimal Dataset Size for Recommender Systems: Evaluating Algorithms' Performance via Downsampling
Authors: Ardalan Arabzadeh,
Abstract summary: This thesis investigates dataset downsampling as a strategy to optimize energy efficiency in recommender systems.<n>By applying two downsampling approaches to seven datasets, 12 algorithms, and two levels of core pruning, the research demonstrates significant reductions in runtime and carbon emissions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This thesis investigates dataset downsampling as a strategy to optimize energy efficiency in recommender systems while maintaining competitive performance. With increasing dataset sizes posing computational and environmental challenges, this study explores the trade-offs between energy efficiency and recommendation quality in Green Recommender Systems, which aim to reduce environmental impact. By applying two downsampling approaches to seven datasets, 12 algorithms, and two levels of core pruning, the research demonstrates significant reductions in runtime and carbon emissions. For example, a 30% downsampling portion can reduce runtime by 52% compared to the full dataset, leading to a carbon emission reduction of up to 51.02 KgCO2e during the training of a single algorithm on a single dataset. The analysis reveals that algorithm performance under different downsampling portions depends on factors like dataset characteristics, algorithm complexity, and the specific downsampling configuration (scenario dependent). Some algorithms, which showed lower nDCG@10 scores compared to higher-performing ones, exhibited lower sensitivity to the amount of training data, offering greater potential for efficiency in lower downsampling portions. On average, these algorithms retained 81% of full-size performance using only 50% of the training set. In certain downsampling configurations, where more users were progressively included while keeping the test set size fixed, they even showed higher nDCG@10 scores than when using the full dataset. These findings highlight the feasibility of balancing sustainability and effectiveness, providing insights for designing energy-efficient recommender systems and promoting sustainable AI practices.

Related papers

Testing the Efficacy of Hyperparameter Optimization Algorithms in Short-Term Load Forecasting [0.0]
We use the Panama Electricity dataset to evaluate HPO algorithms' performances on a surrogate forecasting algorithm, XGBoost, in terms of accuracy (i.e., MAPE, $R2$) and runtime. Results reveal significant runtime advantages for HPO algorithms over Random Search.
arXiv Detail & Related papers (2024-10-19T09:08:52Z)
Green Recommender Systems: Optimizing Dataset Size for Energy-Efficient Algorithm Performance [0.10241134756773229]
This paper investigates the potential for energy-efficient algorithm performance by optimizing dataset sizes. We conducted experiments on the MovieLens 100K, 1M, 10M, and Amazon Toys and Games datasets.
arXiv Detail & Related papers (2024-10-12T04:00:55Z)
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.<n>We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.<n>Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Efficient Architecture Search via Bi-level Data Pruning [70.29970746807882]
This work pioneers an exploration into the critical role of dataset characteristics for DARTS bi-level optimization. We introduce a new progressive data pruning strategy that utilizes supernet prediction dynamics as the metric. Comprehensive evaluations on the NAS-Bench-201 search space, DARTS search space, and MobileNet-like search space validate that BDP reduces search costs by over 50%.
arXiv Detail & Related papers (2023-12-21T02:48:44Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Fast Bayesian Optimization of Needle-in-a-Haystack Problems using Zooming Memory-Based Initialization [73.96101108943986]
A Needle-in-a-Haystack problem arises when there is an extreme imbalance of optimum conditions relative to the size of the dataset. We present a Zooming Memory-Based Initialization algorithm that builds on conventional Bayesian optimization principles.
arXiv Detail & Related papers (2022-08-26T23:57:41Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Balancing Performance and Energy Consumption of Bagging Ensembles for the Classification of Data Streams in Edge Computing [9.801387036837871]
Edge Computing (EC) has emerged as an enabling factor for developing technologies like the Internet of Things (IoT) and 5G networks. This work investigates strategies for optimizing the performance and energy consumption of bagging ensembles to classify data streams.
arXiv Detail & Related papers (2022-01-17T04:12:18Z)
Optimal Actor-Critic Policy with Optimized Training Datasets [8.372742131747522]
Actor-critic (AC) algorithms are known for their efficacy and high performance in solving reinforcement learning problems. They also suffer from low sampling efficiency. We propose a strategy to optimize the training dataset that contains significantly less samples collected from the AC process.
arXiv Detail & Related papers (2021-08-16T06:09:55Z)
Sampling-Decomposable Generative Adversarial Recommender [84.05894139540048]
We propose a Sampling-Decomposable Generative Adversarial Recommender (SD-GAR) In the framework, the divergence between some generator and the optimum is compensated by self-normalized importance sampling. We extensively evaluate the proposed algorithm with five real-world recommendation datasets.
arXiv Detail & Related papers (2020-11-02T13:19:10Z)
SASL: Saliency-Adaptive Sparsity Learning for Neural Network Acceleration [20.92912642901645]
We propose a Saliency-Adaptive Sparsity Learning (SASL) approach for further optimization. Our method can reduce 49.7% FLOPs of ResNet-50 with very negligible 0.39% top-1 and 0.05% top-5 accuracy degradation.
arXiv Detail & Related papers (2020-03-12T16:49:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.