Related papers: OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking

OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking

URL: http://arxiv.org/abs/2309.01552v1
Date: Mon, 4 Sep 2023 12:07:20 GMT
Title: OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking
Authors: Bla\v{z} \v{S}krlj and Bla\v{z} Mramor
Abstract summary: We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The design of modern recommender systems relies on understanding which parts of the feature space are relevant for solving a given recommendation task. However, real-world data sets in this domain are often characterized by their large size, sparsity, and noise, making it challenging to identify meaningful signals. Feature ranking represents an efficient branch of algorithms that can help address these challenges by identifying the most informative features and facilitating the automated search for more compact and better-performing models (AutoML). We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection. OutRank was built with categorical data in mind, utilizing a variant of mutual information that is normalized with regard to the noise produced by features of the same cardinality. We further extend the similarity measure by incorporating information on feature similarity and combined relevance. The proposed approach's feasibility is demonstrated by speeding up the state-of-the-art AutoML system on a synthetic data set with no performance loss. Furthermore, we considered a real-life click-through-rate prediction data set where it outperformed strong baselines such as random forest-based approaches. The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches, enabling faster search for better models on off-the-shelf hardware.

Related papers

Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds [0.09208007322096533]
Investigation focuses on HNSW's efficacy across a spectrum of datasets. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality. We observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models.
arXiv Detail & Related papers (2024-05-28T04:16:43Z)
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z)
Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data. We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures. We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z)
Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data [10.740391800262685]
Feature preprocessing is a crucial step to ensure good model quality. Due to the large search space, a brute-force solution is prohibitively expensive. We extend a variety of HPO and NAS algorithms to solve the Auto-FP problem.
arXiv Detail & Related papers (2023-10-04T02:46:44Z)
Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data. The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships. A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z)
Automated classification of pre-defined movement patterns: A comparison between GNSS and UWB technology [55.41644538483948]
Real-time location systems (RTLS) allow for collecting data from human movement patterns. The current study aims to design and evaluate an automated framework to classify human movement patterns in small areas.
arXiv Detail & Related papers (2023-03-10T14:46:42Z)
Compactness Score: A Fast Filter Method for Unsupervised Feature Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features. Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z)
Self-service Data Classification Using Interactive Visualization and Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm. IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain. This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z)
Large Scale Autonomous Driving Scenarios Clustering with Self-supervised Feature Extraction [6.804209932400134]
This article proposes a comprehensive data clustering framework for a large set of vehicle driving data. Our approach thoroughly considers the traffic elements, including both in-traffic agent objects and map information. With the newly designed driving data clustering evaluation metrics based on data-augmentation, the accuracy assessment does not require a human-labeled data-set.
arXiv Detail & Related papers (2021-03-30T06:22:40Z)
Dual Adversarial Auto-Encoders for Clustering [152.84443014554745]
We propose Dual Adversarial Auto-encoder (Dual-AAE) for unsupervised clustering. By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of Auto-encoders. Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods.
arXiv Detail & Related papers (2020-08-23T13:16:34Z)
ARDA: Automatic Relational Data Augmentation for Machine Learning [23.570173866941612]
We present system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join.
arXiv Detail & Related papers (2020-03-21T21:55:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.