Optimal Data Generation in Multi-Dimensional Parameter Spaces, using
Bayesian Optimization
- URL: http://arxiv.org/abs/2312.02012v1
- Date: Mon, 4 Dec 2023 16:36:29 GMT
- Title: Optimal Data Generation in Multi-Dimensional Parameter Spaces, using
Bayesian Optimization
- Authors: M. R. Mahani, Igor A. Nechepurenko, Yasmin Rahimof, Andreas Wicht
- Abstract summary: We propose a novel approach for constructing a minimal yet highly informative database for training machine learning models.
We mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR)
Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Acquiring a substantial number of data points for training accurate machine
learning (ML) models is a big challenge in scientific fields where data
collection is resource-intensive. Here, we propose a novel approach for
constructing a minimal yet highly informative database for training ML models
in complex multi-dimensional parameter spaces. To achieve this, we mimic the
underlying relation between the output and input parameters using Gaussian
process regression (GPR). Using a set of known data, GPR provides predictive
means and standard deviation for the unknown data. Given the predicted standard
deviation by GPR, we select data points using Bayesian optimization to obtain
an efficient database for training ML models. We compare the performance of ML
models trained on databases obtained through this method, with databases
obtained using traditional approaches. Our results demonstrate that the ML
models trained on the database obtained using Bayesian optimization approach
consistently outperform the other two databases, achieving high accuracy with a
significantly smaller number of data points. Our work contributes to the
resource-efficient collection of data in high-dimensional complex parameter
spaces, to achieve high precision machine learning predictions.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Variational Factorization Machines for Preference Elicitation in
Large-Scale Recommender Systems [17.050774091903552]
We propose a variational formulation of factorization machines (FMs) that can be easily optimized using standard mini-batch descent gradient.
Our algorithm learns an approximate posterior distribution over the user and item parameters, which leads to confidence intervals over the predictions.
We show, using several datasets, that it has comparable or better performance than existing methods in terms of prediction accuracy.
arXiv Detail & Related papers (2022-12-20T00:06:28Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Efficient and Accurate In-Database Machine Learning with SQL Code
Generation in Python [0.0]
We describe a novel method for In-Database Machine Learning (IDBML) in Python using template macros in Jinja2.
Our method was 2-3% less accurate than the best current state-of-the-art methods we found (decision trees and random forests) and 2-3 times slower for one in-memory dataset.
arXiv Detail & Related papers (2021-04-07T16:23:19Z) - Monotonic Cardinality Estimation of Similarity Selection: A Deep
Learning Approach [22.958342743597044]
We investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection.
We propose a novel and generic method that can be applied to any data type and distance function.
arXiv Detail & Related papers (2020-02-15T20:22:51Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.