OutRank: Speeding up AutoML-based Model Search for Large Sparse Data
sets with Cardinality-aware Feature Ranking
- URL: http://arxiv.org/abs/2309.01552v1
- Date: Mon, 4 Sep 2023 12:07:20 GMT
- Title: OutRank: Speeding up AutoML-based Model Search for Large Sparse Data
sets with Cardinality-aware Feature Ranking
- Authors: Bla\v{z} \v{S}krlj and Bla\v{z} Mramor
- Abstract summary: We introduce OutRank, a system for versatile feature ranking and data quality-related anomaly detection.
The proposed approach enables exploration of up to 300% larger feature spaces compared to AutoML-only approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The design of modern recommender systems relies on understanding which parts
of the feature space are relevant for solving a given recommendation task.
However, real-world data sets in this domain are often characterized by their
large size, sparsity, and noise, making it challenging to identify meaningful
signals. Feature ranking represents an efficient branch of algorithms that can
help address these challenges by identifying the most informative features and
facilitating the automated search for more compact and better-performing models
(AutoML). We introduce OutRank, a system for versatile feature ranking and data
quality-related anomaly detection. OutRank was built with categorical data in
mind, utilizing a variant of mutual information that is normalized with regard
to the noise produced by features of the same cardinality. We further extend
the similarity measure by incorporating information on feature similarity and
combined relevance. The proposed approach's feasibility is demonstrated by
speeding up the state-of-the-art AutoML system on a synthetic data set with no
performance loss. Furthermore, we considered a real-life click-through-rate
prediction data set where it outperformed strong baselines such as random
forest-based approaches. The proposed approach enables exploration of up to
300% larger feature spaces compared to AutoML-only approaches, enabling faster
search for better models on off-the-shelf hardware.
Related papers
- The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds [0.09208007322096533]
Investigation focuses on HNSW's efficacy across a spectrum of datasets.
We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality.
We observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models.
arXiv Detail & Related papers (2024-05-28T04:16:43Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data.
We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures.
We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z) - Auto-FP: An Experimental Study of Automated Feature Preprocessing for
Tabular Data [10.740391800262685]
Feature preprocessing is a crucial step to ensure good model quality.
Due to the large search space, a brute-force solution is prohibitively expensive.
We extend a variety of HPO and NAS algorithms to solve the Auto-FP problem.
arXiv Detail & Related papers (2023-10-04T02:46:44Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Automated classification of pre-defined movement patterns: A comparison
between GNSS and UWB technology [55.41644538483948]
Real-time location systems (RTLS) allow for collecting data from human movement patterns.
The current study aims to design and evaluate an automated framework to classify human movement patterns in small areas.
arXiv Detail & Related papers (2023-03-10T14:46:42Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Self-service Data Classification Using Interactive Visualization and
Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm.
IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain.
This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z) - Large Scale Autonomous Driving Scenarios Clustering with Self-supervised
Feature Extraction [6.804209932400134]
This article proposes a comprehensive data clustering framework for a large set of vehicle driving data.
Our approach thoroughly considers the traffic elements, including both in-traffic agent objects and map information.
With the newly designed driving data clustering evaluation metrics based on data-augmentation, the accuracy assessment does not require a human-labeled data-set.
arXiv Detail & Related papers (2021-03-30T06:22:40Z) - Dual Adversarial Auto-Encoders for Clustering [152.84443014554745]
We propose Dual Adversarial Auto-encoder (Dual-AAE) for unsupervised clustering.
By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of Auto-encoders.
Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods.
arXiv Detail & Related papers (2020-08-23T13:16:34Z) - ARDA: Automatic Relational Data Augmentation for Machine Learning [23.570173866941612]
We present system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set.
Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join.
arXiv Detail & Related papers (2020-03-21T21:55:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.