Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
- URL: http://arxiv.org/abs/2504.07297v1
- Date: Wed, 09 Apr 2025 21:40:15 GMT
- Title: Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
- Authors: Robert J Appleton, Brian C Barnes, Alejandro Strachan,
- Abstract summary: We use data fusion techniques to combine the learned molecular embeddings of various single-task models and trained a multi-task model on this combined embedding.<n>We show that the fused, multi-task models outperform standard multi-task models for sparse datasets and can provide enhanced prediction on data-limited properties compared to single-task models.
- Score: 44.99833362998488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many problems data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multi-task learning have been used. The performance of multi-task learning models depends on the strength of the underlying correlations between tasks and the completeness of the dataset. We find that standard multi-task models tend to underperform when trained on sparse datasets with weakly correlated properties. To address this gap, we use data fusion techniques to combine the learned molecular embeddings of various single-task models and trained a multi-task model on this combined embedding. We apply this technique to a widely used benchmark dataset of quantum chemistry data for small molecules as well as a newly compiled sparse dataset of experimental data collected from literature and our own quantum chemistry and thermochemical calculations. The results show that the fused, multi-task models outperform standard multi-task models for sparse datasets and can provide enhanced prediction on data-limited properties compared to single-task models.
Related papers
- Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training [1.3980986259786223]
We show that dataset diversity can impact the performance of vision models.<n>Our study shows positive correlations between test set accuracy and data diversity.<n>These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance.
arXiv Detail & Related papers (2025-01-15T00:56:59Z) - Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning [79.75718786477638]
We exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches.
We demonstrate that the more accurate energy data can improve the accuracy of structure prediction.
We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction.
arXiv Detail & Related papers (2024-10-14T03:11:33Z) - Analysing Multi-Task Regression via Random Matrix Theory with Application to Time Series Forecasting [16.640336442849282]
We formulate a multi-task optimization problem as a regularization technique to enable single-task models to leverage multi-task learning information.
We derive a closed-form solution for multi-task optimization in the context of linear models.
arXiv Detail & Related papers (2024-06-14T17:59:25Z) - Towards Precision Healthcare: Robust Fusion of Time Series and Image Data [8.579651833717763]
We introduce a new method that uses two separate encoders, one for each type of data, allowing the model to understand complex patterns in both visual and time-based information.
We also deal with imbalanced datasets and use an uncertainty loss function, yielding improved results.
Our experiments show that our method is effective in improving multimodal deep learning for clinical applications.
arXiv Detail & Related papers (2024-05-24T11:18:13Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Statistical learning for accurate and interpretable battery lifetime
prediction [1.738360170201861]
We develop simple, accurate, and interpretable data-driven models for battery lifetime prediction.
Our approaches can be used both to quickly train models for a new dataset and to benchmark the performance of more advanced machine learning methods.
arXiv Detail & Related papers (2021-01-06T06:05:24Z) - Polymer Informatics with Multi-Task Learning [0.06524460254566902]
We show the potency of multi-task learning approaches that exploit inherent correlations effectively.
Data pertaining to 36 different properties of over $13, 000$ polymers are coalesced and supplied to deep-learning multi-task architectures.
The multi-task approach is accurate, efficient, scalable, and amenable to transfer learning as more data on the same or different properties become available.
arXiv Detail & Related papers (2020-10-28T18:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.