Synthetic Data for Feature Selection
- URL: http://arxiv.org/abs/2211.03035v1
- Date: Sun, 6 Nov 2022 05:57:01 GMT
- Title: Synthetic Data for Feature Selection
- Authors: Firuz Kamalov, Hana Sulieman, Aswani Kumar Cherukuri
- Abstract summary: We propose a collection of synthetic datasets that can be used as a common reference point for feature selection algorithms.
The proposed datasets are based on applications from electronics in order to mimic real life scenarios.
The datasets are made publicly available on GitHub and can be used by researchers to evaluate feature selection algorithms.
- Score: 5.8010446129208155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature selection is an important and active field of research in machine
learning and data science. Our goal in this paper is to propose a collection of
synthetic datasets that can be used as a common reference point for feature
selection algorithms. Synthetic datasets allow for precise evaluation of
selected features and control of the data parameters for comprehensive
assessment. The proposed datasets are based on applications from electronics in
order to mimic real life scenarios. To illustrate the utility of the proposed
data we employ one of the datasets to test several popular feature selection
algorithms. The datasets are made publicly available on GitHub and can be used
by researchers to evaluate feature selection algorithms.
Related papers
- DataMIL: Selecting Data for Robot Imitation Learning with Datamodels [77.48472034791213]
We introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm.<n>Unlike standard practices that filter data using human notions of quality, DataMIL directly optimize data selection for task success.<n>We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks.
arXiv Detail & Related papers (2025-05-14T17:55:10Z) - Algorithm Performance Spaces for Strategic Dataset Selection [0.0]
The evaluation of new algorithms in recommender systems frequently depends on publicly available datasets, such as those from MovieLens or Amazon.<n>This thesis introduces the Algorithm Performance Space, a framework designed to differentiate datasets based on the measured performance of algorithms applied to them.
arXiv Detail & Related papers (2025-04-29T12:29:52Z) - Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems [0.0]
Synthetic datasets are important for evaluating and testing machine learning models.
We develop a novel framework for generating synthetic datasets that are diverse and statistically coherent.
The framework is available as a free open Python package to facilitate research with minimal friction.
arXiv Detail & Related papers (2024-11-27T09:53:14Z) - Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery.
The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z) - Data Generation via Latent Factor Simulation for Fairness-aware Re-ranking [11.133319460036082]
Synthetic data is a useful resource for algorithmic research.
We propose a novel type of data for fairness-aware recommendation: synthetic recommender system outputs.
arXiv Detail & Related papers (2024-09-21T09:13:50Z) - Personalized Federated Learning via Active Sampling [50.456464838807115]
This paper proposes a novel method for sequentially identifying similar (or relevant) data generators.
Our method evaluates the relevance of a data generator by evaluating the effect of a gradient step using its local dataset.
We extend this method to non-parametric models by a suitable generalization of the gradient step to update a hypothesis using the local dataset provided by a data generator.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [36.22392593103493]
Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets.
Existing surveys overlook an in-depth exploration of the fine-tuning phase.
We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods.
arXiv Detail & Related papers (2024-06-20T08:58:58Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - A Performance-Driven Benchmark for Feature Selection in Tabular Deep
Learning [131.2910403490434]
Data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones.
Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance.
We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers.
We also propose an input-gradient-based analogue of Lasso for neural networks that outperforms classical feature selection methods on challenging problems.
arXiv Detail & Related papers (2023-11-10T05:26:10Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - DataFinder: Scientific Dataset Recommendation from Natural Language
Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description.
To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set.
This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z) - Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset
Evaluation for Text Classification [39.01740345482624]
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary.
Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power.
arXiv Detail & Related papers (2022-05-04T15:33:00Z) - On Feature Selection Using Anisotropic General Regression Neural Network [3.880707330499936]
The presence of irrelevant features in the input dataset tends to reduce the interpretability and predictive quality of machine learning models.
Here we show how the General Regression Neural Network used with an anisotropic Gaussian Kernel can be used to perform feature selection.
arXiv Detail & Related papers (2020-10-12T14:35:40Z) - On the Use of Interpretable Machine Learning for the Management of Data
Quality [13.075880857448059]
We propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity.
Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets.
arXiv Detail & Related papers (2020-07-29T08:49:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.