APS Explorer: Navigating Algorithm Performance Spaces for Informed Dataset Selection
- URL: http://arxiv.org/abs/2508.19399v1
- Date: Tue, 26 Aug 2025 19:46:29 GMT
- Title: APS Explorer: Navigating Algorithm Performance Spaces for Informed Dataset Selection
- Authors: Tobias Vente, Michael Heep, Abdullah Abbas, Theodor Sperle, Joeran Beel, Bart Goethals,
- Abstract summary: 86% of ACM RecSys 2024 papers provide no justification for their dataset choices.<n>Most relying on just four datasets: Amazon (38%), MovieLens (34%), Yelp (15%), and Gowalla (12%)
- Score: 0.046180371154032895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset selection is crucial for offline recommender system experiments, as mismatched data (e.g., sparse interaction scenarios require datasets with low user-item density) can lead to unreliable results. Yet, 86\% of ACM RecSys 2024 papers provide no justification for their dataset choices, with most relying on just four datasets: Amazon (38\%), MovieLens (34\%), Yelp (15\%), and Gowalla (12\%). While Algorithm Performance Spaces (APS) were proposed to guide dataset selection, their adoption has been limited due to the absence of an intuitive, interactive tool for APS exploration. Therefore, we introduce the APS Explorer, a web-based visualization tool for interactive APS exploration, enabling data-driven dataset selection. The APS Explorer provides three interactive features: (1) an interactive PCA plot showing dataset similarity via performance patterns, (2) a dynamic meta-feature table for dataset comparisons, and (3) a specialized visualization for pairwise algorithm performance.
Related papers
- Informed Dataset Selection [0.0]
We developed the APS Explorer, a web application that im- plements the Algorithm Performance Space framework for informed dataset selection.<n>The system analyzes 96 datasets using 28 algorithms across three metrics (nDCG, Hit Ratio, Recall) at five K-values.<n>We extend the APS framework with a statistical based classification system that categorizes datasets into five difficulty levels based on quintiles.
arXiv Detail & Related papers (2025-09-30T16:04:51Z) - LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision [13.437102865245285]
We introduce Labeling Copilot, the first data curation deep research agent for computer vision.<n>A central orchestrator agent, powered by a large multimodal language model, uses multi-step reasoning to execute specialized tools across three core capabilities.
arXiv Detail & Related papers (2025-09-26T17:55:26Z) - COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning [19.173177969412656]
We present COLLAGE, a method for COLLective data AGgrEgation in few-shot imitation learning.<n>Collage uses an adaptive late fusion mechanism to guide the selection of relevant demonstrations based on a task-specific combination of multiple cues.<n>Collage outperforms state-of-the-art retrieval and multi-task learning approaches by 5.1% in simulation across 10 tasks, and by 16.6% in the real world across 6 tasks.
arXiv Detail & Related papers (2025-08-02T01:23:09Z) - Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z) - Group-Level Data Selection for Efficient Pretraining [49.18903821780051]
Group-MATES is an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining.<n>Group-MATES parameterizes costly group-level selection with a relational data influence model.
arXiv Detail & Related papers (2025-02-20T16:34:46Z) - TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection [26.059907173437114]
TSceneJAL framework can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data.<n>Our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.
arXiv Detail & Related papers (2024-12-25T11:07:04Z) - UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with
Vision-Language Models [24.50445616970387]
We introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models for data pre-selection.
Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation.
We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%.
arXiv Detail & Related papers (2023-07-20T20:45:13Z) - Going beyond research datasets: Novel intent discovery in the industry
setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform.
We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision.
We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.