Multitask methods for predicting molecular properties from heterogeneous data
- URL: http://arxiv.org/abs/2401.17898v2
- Date: Sat, 25 May 2024 03:10:38 GMT
- Title: Multitask methods for predicting molecular properties from heterogeneous data
- Authors: Katharine Fisher, Michael Herbst, Youssef Marzouk,
- Abstract summary: We demonstrate that multitask Gaussian process regression overcomes the limitation by leveraging both expensive and cheap data sources.
We report that multitask surrogates can predict at CC-level accuracy with a reduction to data generation cost by over an order of magnitude.
multitask regression can be a tool for reducing data generation costs even further by opportunistically exploiting existing data sources.
- Score: 0.27309692684728615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data generation remains a bottleneck in training surrogate models to predict molecular properties. We demonstrate that multitask Gaussian process regression overcomes this limitation by leveraging both expensive and cheap data sources. In particular, we consider training sets constructed from coupled-cluster (CC) and density functional theory (DFT) data. We report that multitask surrogates can predict at CC-level accuracy with a reduction to data generation cost by over an order of magnitude. Of note, our approach allows the training set to include DFT data generated by a heterogeneous mix of exchange-correlation functionals without imposing any artificial hierarchy on functional accuracy. More generally, the multitask framework can accommodate a wider range of training set structures -- including full disparity between the different levels of fidelity -- than existing kernel approaches based on $\Delta$-learning, though we show that the accuracy of the two approaches can be similar. Consequently, multitask regression can be a tool for reducing data generation costs even further by opportunistically exploiting existing data sources.
Related papers
- Analysing Multi-Task Regression via Random Matrix Theory with Application to Time Series Forecasting [16.640336442849282]
We formulate a multi-task optimization problem as a regularization technique to enable single-task models to leverage multi-task learning information.
We derive a closed-form solution for multi-task optimization in the context of linear models.
arXiv Detail & Related papers (2024-06-14T17:59:25Z) - Multifidelity linear regression for scientific machine learning from scarce data [0.0]
We propose a new multifidelity training approach for scientific machine learning via linear regression.
We provide bias and variance analysis of our new estimators that guarantee the approach's accuracy and improved robustness to scarce high-fidelity data.
arXiv Detail & Related papers (2024-03-13T15:40:17Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Heterogeneous Multi-Task Gaussian Cox Processes [61.67344039414193]
We present a novel extension of multi-task Gaussian Cox processes for modeling heterogeneous correlated tasks jointly.
A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks.
We derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters.
arXiv Detail & Related papers (2023-08-29T15:01:01Z) - Multi-fidelity Fourier Neural Operator for Fast Modeling of Large-Scale
Geological Carbon Storage [0.0]
We propose to use a multi-fidelity Fourier neural operator (FNO) to solve large-scale carbon storage problems.
We first test the model efficacy on a GCS reservoir model being discretized into 110k grid cells.
The multi-fidelity model can predict with accuracy comparable to a high-fidelity model trained with the same amount of high-fidelity data with 81% less data generation costs.
arXiv Detail & Related papers (2023-08-17T17:44:59Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Multi-Task Model Personalization for Federated Supervised SVM in
Heterogeneous Networks [10.169907307499916]
Federated systems enable collaborative training on highly heterogeneous data through model personalization.
To accelerate the learning procedure for diverse participants in a multi-task federated setting, more efficient and robust methods need to be developed.
In this paper, we design an efficient iterative distributed method based on the alternating direction method of multipliers (ADMM) for support vector machines (SVMs)
The proposed method utilizes efficient computations and model exchange in a network of heterogeneous nodes and allows personalization of the learning model in the presence of non-i.i.d. data.
arXiv Detail & Related papers (2023-03-17T21:36:01Z) - Robust Direct Learning for Causal Data Fusion [14.462235940634969]
We provide a framework for integrating multi-source data that separates the treatment effect from other nuisance functions.
We also propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory.
arXiv Detail & Related papers (2022-11-01T03:33:22Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Multi-fidelity Hierarchical Neural Processes [79.0284780825048]
Multi-fidelity surrogate modeling reduces the computational cost by fusing different simulation outputs.
We propose Multi-fidelity Hierarchical Neural Processes (MF-HNP), a unified neural latent variable model for multi-fidelity surrogate modeling.
We evaluate MF-HNP on epidemiology and climate modeling tasks, achieving competitive performance in terms of accuracy and uncertainty estimation.
arXiv Detail & Related papers (2022-06-10T04:54:13Z) - Brain Image Synthesis with Unsupervised Multivariate Canonical
CSC$\ell_4$Net [122.8907826672382]
We propose to learn dedicated features that cross both intre- and intra-modal variations using a novel CSC$ell_4$Net.
arXiv Detail & Related papers (2021-03-22T05:19:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.