ARDA: Automatic Relational Data Augmentation for Machine Learning
- URL: http://arxiv.org/abs/2003.09758v1
- Date: Sat, 21 Mar 2020 21:55:22 GMT
- Title: ARDA: Automatic Relational Data Augmentation for Machine Learning
- Authors: Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez,
Tim Kraska, David Karger
- Abstract summary: We present system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set.
Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join.
- Score: 23.570173866941612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic machine learning (\AML) is a family of techniques to automate the
process of training predictive models, aiming to both improve performance and
make machine learning more accessible. While many recent works have focused on
aspects of the machine learning pipeline like model selection, hyperparameter
tuning, and feature selection, relatively few works have focused on automatic
data augmentation. Automatic data augmentation involves finding new features
relevant to the user's predictive task with minimal ``human-in-the-loop''
involvement.
We present \system, an end-to-end system that takes as input a dataset and a
data repository, and outputs an augmented data set such that training a
predictive model on this augmented dataset results in improved performance. Our
system has two distinct components: (1) a framework to search and join data
with the input data, based on various attributes of the input, and (2) an
efficient feature selection algorithm that prunes out noisy or irrelevant
features from the resulting join. We perform an extensive empirical evaluation
of different system components and benchmark our feature selection algorithm on
real-world datasets.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Continual Learning for Multimodal Data Fusion of a Soft Gripper [1.0589208420411014]
A model trained on one data modality often fails when tested with a different modality.
We introduce a continual learning algorithm capable of incrementally learning different data modalities.
We evaluate the algorithm's effectiveness on a challenging custom multimodal dataset.
arXiv Detail & Related papers (2024-09-20T09:53:27Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Towards Open-World Feature Extrapolation: An Inductive Graph Learning
Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning.
Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z) - Federated Feature Selection for Cyber-Physical Systems of Systems [0.3609538870261841]
A fleet of autonomous vehicles finds a consensus on the optimal set of features that they exploit to reduce data transmission up to 99% with negligible information loss.
Our results show that a fleet of autonomous vehicles finds a consensus on the optimal set of features that they exploit to reduce data transmission up to 99% with negligible information loss.
arXiv Detail & Related papers (2021-09-23T12:16:50Z) - Self-service Data Classification Using Interactive Visualization and
Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm.
IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain.
This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z) - Text Classification Using Hybrid Machine Learning Algorithms on Big Data [0.0]
In this work, two supervised machine learning algorithms are combined with text mining techniques to produce a hybrid model.
The result shows that the hybrid model gave 96.76% accuracy as against the 61.45% and 69.21% of the Na"ive Bayes and SVM models respectively.
arXiv Detail & Related papers (2021-03-30T19:02:48Z) - Adaptive Weighting Scheme for Automatic Time-Series Data Augmentation [79.47771259100674]
We present two sample-adaptive automatic weighting schemes for data augmentation.
We validate our proposed methods on a large, noisy financial dataset and on time-series datasets from the UCR archive.
On the financial dataset, we show that the methods in combination with a trading strategy lead to improvements in annualized returns of over 50$%$, and on the time-series data we outperform state-of-the-art models on over half of the datasets, and achieve similar performance in accuracy on the others.
arXiv Detail & Related papers (2021-02-16T17:50:51Z) - Feedback-Based Dynamic Feature Selection for Constrained Continuous Data
Acquisition [6.947442090579469]
We propose a feedback-based dynamic feature selection algorithm that efficiently decides on the feature set for data collection from a dynamic system in a step-wise manner.
Our evaluation shows that the proposed feedback-based feature selection algorithm has superior performance over constrained baseline methods.
arXiv Detail & Related papers (2020-11-10T14:19:01Z) - AutoFIS: Automatic Feature Interaction Selection in Factorization Models
for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS)
AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence.
AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.