Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies
- URL: http://arxiv.org/abs/2410.11392v1
- Date: Tue, 15 Oct 2024 08:35:00 GMT
- Title: Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies
- Authors: Vivin Vinod, Peter Zaspel,
- Abstract summary: This study investigates the impact of modifying $gamma$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset.
A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity.
The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used.
- Score: 0.0
- License:
- Abstract: Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $\gamma$, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $\gamma$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $\theta$, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $\Gamma$-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.
Related papers
- Transfer Learning on Multi-Dimensional Data: A Novel Approach to Neural Network-Based Surrogate Modeling [0.0]
Convolutional neural networks (CNNs) have gained popularity as the basis for such surrogate models.
We propose training a CNN surrogate model on a mixture of numerical solutions to both the $d$-dimensional problem and its ($d-1$)-dimensional approximation.
We demonstrate our approach on a multiphase flow test problem, using transfer learning to train a dense fully-convolutional encoder-decoder CNN on the two classes of data.
arXiv Detail & Related papers (2024-10-16T05:07:48Z) - Benchmarking Data Efficiency in $Δ$-ML and Multifidelity Models for Quantum Chemistry [0.0]
This work compares the data costs associated with $Delta$-ML, multifidelity machine learning (MFML), and optimized MFML (o-MFML)
The results indicate that the use of multifidelity methods surpasses the standard $Delta$-ML approaches in cases of a large number of predictions.
arXiv Detail & Related papers (2024-10-15T08:34:32Z) - Learning Augmentation Policies from A Model Zoo for Time Series Forecasting [58.66211334969299]
We introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning.
By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance.
arXiv Detail & Related papers (2024-09-10T07:34:19Z) - Assessing Non-Nested Configurations of Multifidelity Machine Learning for Quantum-Chemical Properties [0.0]
Multifidelity machine learning (MFML) for quantum chemical (QC) properties has seen strong development in the recent years.
This work assesses the use of non-nested training data for two of these multifidelity methods, namely MFML and optimized MFML.
arXiv Detail & Related papers (2024-07-24T08:34:08Z) - Multifidelity linear regression for scientific machine learning from scarce data [0.0]
We propose a new multifidelity training approach for scientific machine learning via linear regression.
We provide bias and variance analysis of our new estimators that guarantee the approach's accuracy and improved robustness to scarce high-fidelity data.
arXiv Detail & Related papers (2024-03-13T15:40:17Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z) - Machine Learning Force Fields with Data Cost Aware Training [94.78998399180519]
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation.
Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels.
We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
arXiv Detail & Related papers (2023-06-05T04:34:54Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Multi-fidelity Hierarchical Neural Processes [79.0284780825048]
Multi-fidelity surrogate modeling reduces the computational cost by fusing different simulation outputs.
We propose Multi-fidelity Hierarchical Neural Processes (MF-HNP), a unified neural latent variable model for multi-fidelity surrogate modeling.
We evaluate MF-HNP on epidemiology and climate modeling tasks, achieving competitive performance in terms of accuracy and uncertainty estimation.
arXiv Detail & Related papers (2022-06-10T04:54:13Z) - Adaptive Reliability Analysis for Multi-fidelity Models using a
Collective Learning Strategy [6.368679897630892]
This study presents a new approach called adaptive multi-fidelity Gaussian process for reliability analysis (AMGPRA)
It is shown that the proposed method achieves similar or higher accuracy with reduced computational costs compared to state-of-the-art single and multi-fidelity methods.
A key application of AMGPRA is high-fidelity fragility modeling using complex and costly physics-based computational models.
arXiv Detail & Related papers (2021-09-21T14:42:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.