Related papers: Optimal Data Generation in Multi-Dimensional Parameter Spaces, using Bayesian Optimization

Optimal Data Generation in Multi-Dimensional Parameter Spaces, using Bayesian Optimization

URL: http://arxiv.org/abs/2312.02012v1
Date: Mon, 4 Dec 2023 16:36:29 GMT
Title: Optimal Data Generation in Multi-Dimensional Parameter Spaces, using Bayesian Optimization
Authors: M. R. Mahani, Igor A. Nechepurenko, Yasmin Rahimof, Andreas Wicht
Abstract summary: We propose a novel approach for constructing a minimal yet highly informative database for training machine learning models. We mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR) Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Acquiring a substantial number of data points for training accurate machine learning (ML) models is a big challenge in scientific fields where data collection is resource-intensive. Here, we propose a novel approach for constructing a minimal yet highly informative database for training ML models in complex multi-dimensional parameter spaces. To achieve this, we mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR). Using a set of known data, GPR provides predictive means and standard deviation for the unknown data. Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models. We compare the performance of ML models trained on databases obtained through this method, with databases obtained using traditional approaches. Our results demonstrate that the ML models trained on the database obtained using Bayesian optimization approach consistently outperform the other two databases, achieving high accuracy with a significantly smaller number of data points. Our work contributes to the resource-efficient collection of data in high-dimensional complex parameter spaces, to achieve high precision machine learning predictions.

Related papers

Rethinking Data: Towards Better Performing Domain-Specific Small Language Models [0.0]
This paper presents our approach to finetuning a small Language Models (LM) We achieve this by improving data quality at each stage of the LM training pipeline. We improve the model generalization ability by merging the models fine-tuned with different parameters on different data subsets.
arXiv Detail & Related papers (2025-03-03T12:19:12Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
Efficient Alignment of Large Language Models via Data Sampling [0.4915744683251149]
We propose an information theory-based methodology for efficient alignment by identifying a small high quality subset. We find that the model aligned using our proposed methodology outperforms other sampling methods and performs comparable to the model aligned with the full dataset.
arXiv Detail & Related papers (2024-11-15T19:36:15Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Minimally Supervised Learning using Topological Projections in Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs) Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU) Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
Variational Factorization Machines for Preference Elicitation in Large-Scale Recommender Systems [17.050774091903552]
We propose a variational formulation of factorization machines (FMs) that can be easily optimized using standard mini-batch descent gradient. Our algorithm learns an approximate posterior distribution over the user and item parameters, which leads to confidence intervals over the predictions. We show, using several datasets, that it has comparable or better performance than existing methods in terms of prediction accuracy.
arXiv Detail & Related papers (2022-12-20T00:06:28Z)
Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a. sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training. Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python [0.0]
We describe a novel method for In-Database Machine Learning (IDBML) in Python using template macros in Jinja2. Our method was 2-3% less accurate than the best current state-of-the-art methods we found (decision trees and random forests) and 2-3 times slower for one in-memory dataset.
arXiv Detail & Related papers (2021-04-07T16:23:19Z)
Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.