Related papers: Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

URL: http://arxiv.org/abs/2010.00532v2
Date: Wed, 31 Mar 2021 17:56:48 GMT
Title: Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks
Authors: Aditi S. Krishnapriyan, Joseph Montoya, Maciej Haranczyk, Jens Hummelsh{\o}j, Dmitriy Morozov
Abstract summary: We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material's structure and chemistry. It automatically encapsulates geometric and chemical information directly from the material system. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features.
Score: 0.07874708385247352
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material's structure and chemistry. This approach builds on computational topology techniques (namely, persistent homology) and word embeddings from natural language processing. It automatically encapsulates geometric and chemical information directly from the material system. We demonstrate our approach on multiple nanoporous metal-organic framework datasets by predicting methane and carbon dioxide adsorption across different conditions. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features, consistently achieving an average 25-30% decrease in root-mean-squared-deviation and an average increase of 40-50% in R2 scores. A key advantage of our approach is interpretability: Our model identifies the pores that correlate best to adsorption at different pressures, which contributes to understanding atomic-level structure--property relationships for materials design.

Related papers

Chemistry Integrated Language Model using Hierarchical Molecular Representation for Polymer Informatics [0.0]
Machine learning has transformed material discovery for inorganic compounds and small molecules, yet polymers remain largely inaccessible to these methods.<n>We introduce CI-LLM, a framework combining HAPPY, which encodes chemical substructures as tokens, with numerical descriptors within transformer architectures.<n>For property prediction, De$3$BERTa achieves 3.5x faster inference than SMILES-based models with improved accuracy.<n>For inverse design, our GPT-based generator produces polymers with targeted properties, achieving 100 percent scaffold retention and successful multi-property optimization.
arXiv Detail & Related papers (2025-12-06T05:07:11Z)
Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z)
Benchmarking Multimodal LLMs on Recognition and Understanding over Chemical Tables [48.39080455781475]
ChemTable is a large-scale benchmark of real-world chemical tables curated from the experimental sections of literature.<n>ChemTable includes expert-annotated cell polygons, logical layouts, and domain-specific labels, including reagents, catalysts, yields, and graphical components.<n>We evaluated a range of representative multimodal models, including both open-source and closed-source models, on ChemTable and reported a series of findings with practical and conceptual insights.
arXiv Detail & Related papers (2025-06-13T00:45:41Z)
DARWIN 1.5: Large Language Models as Materials Science Adapted Learners [46.7259033847682]
We propose DARWIN 1.5, the largest open-source large language model tailored for materials science. DARWIN eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery. Our approach integrates 6M material domain papers and 21 experimental datasets from 49,256 materials across modalities while enabling cross-task knowledge transfer.
arXiv Detail & Related papers (2024-12-16T16:51:27Z)
Data-efficient and Interpretable Inverse Materials Design using a Disentangled Variational Autoencoder [2.563209727695243]
Inverse materials design has proven successful in accelerating novel material discovery. Many inverse materials design methods use unsupervised learning where a latent space is learned to offer a compact description of materials representations. Here, we present a semi-supervised learning approach based on a disentangled variational autoencoder to learn a probabilistic relationship between features, latent variables and target properties.
arXiv Detail & Related papers (2024-09-10T02:21:13Z)
Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network [0.9736758288065405]
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model.
arXiv Detail & Related papers (2024-09-03T09:14:21Z)
A Large Encoder-Decoder Family of Foundation Models For Chemical Language [1.1073864511426255]
This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem. Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks.
arXiv Detail & Related papers (2024-07-24T20:30:39Z)
Benchmark on Drug Target Interaction Modeling from a Structure Perspective [48.60648369785105]
Drug-target interaction prediction is crucial to drug discovery and design. Recent methods, such as those based on graph neural networks (GNNs) and Transformers, demonstrate exceptional performance across various datasets. We conduct a comprehensive survey and benchmark for drug-target interaction modeling from a structure perspective, via integrating tens of explicit (i.e., GNN-based) and implicit (i.e., Transformer-based) structure learning algorithms.
arXiv Detail & Related papers (2024-07-04T16:56:59Z)
Fine-Tuned Language Models Generate Stable Inorganic Materials as Text [57.01994216693825]
Fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable. We show that our strongest model can generate materials predicted to be metastable at about twice the rate of CDVAE. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material.
arXiv Detail & Related papers (2024-02-06T20:35:28Z)
On the importance of catalyst-adsorbate 3D interactions for relaxed energy predictions [98.70797778496366]
We investigate whether it is possible to predict a system's relaxed energy in the OC20 dataset while ignoring the relative position of the adsorbate. We find that while removing binding site information impairs accuracy as expected, modified models are able to predict relaxed energies with remarkably decent MAE.
arXiv Detail & Related papers (2023-10-10T14:57:04Z)
Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction. Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations. On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z)
A Machine Learning Method for Material Property Prediction: Example Polymer Compatibility [39.364776649251944]
We present a brand-new and general machine learning method for material property prediction. As a representative example, polymer compatibility is chosen to demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2022-02-28T05:48:05Z)
Improving VAE based molecular representations for compound property prediction [0.0]
We propose a simple method to improve chemical property prediction performance of machine learning models. We show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset.
arXiv Detail & Related papers (2022-01-13T12:57:11Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
Graph Neural Network for Hamiltonian-Based Material Property Prediction [56.94118357003096]
We present and compare several different graph convolution networks that are able to predict the band gap for inorganic materials. The models are developed to incorporate two different features: the information of each orbital itself and the interaction between each other. The results show that our model can get a promising prediction accuracy with cross-validation.
arXiv Detail & Related papers (2020-05-27T13:32:10Z)
Explainable Deep Relational Networks for Predicting Compound-Protein Affinities and Contacts [80.69440684790925]
DeepRelations is a physics-inspired deep relational network with intrinsically explainable architecture. It shows superior interpretability to the state-of-the-art. It boosts the AUPRC of contact prediction 9.5, 16.9, 19.3 and 5.7-fold for the test, compound-unique, protein-unique, and both-unique sets.
arXiv Detail & Related papers (2019-12-29T00:14:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.