A smile is all you need: Predicting limiting activity coefficients from
SMILES with natural language processing
- URL: http://arxiv.org/abs/2206.07048v1
- Date: Wed, 15 Jun 2022 07:11:37 GMT
- Title: A smile is all you need: Predicting limiting activity coefficients from
SMILES with natural language processing
- Authors: Benedikt Winter, Clemens Winter, Johannes Schilling, Andr\'e Bardow
- Abstract summary: We introduce the SMILES-to-Properties-Transformer (SPT), a natural language processing network to predict binary limiting activity coefficients from SMILES codes.
We train our network on a large dataset of synthetic data sampled from COSMO-RS and fine-tune the model on experimental data.
This training strategy enables SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models.
- Score: 0.1349420109127767
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Knowledge of mixtures' phase equilibria is crucial in nature and technical
chemistry. Phase equilibria calculations of mixtures require activity
coefficients. However, experimental data on activity coefficients is often
limited due to high cost of experiments. For an accurate and efficient
prediction of activity coefficients, machine learning approaches have been
recently developed. However, current machine learning approaches still
extrapolate poorly for activity coefficients of unknown molecules. In this
work, we introduce the SMILES-to-Properties-Transformer (SPT), a natural
language processing network to predict binary limiting activity coefficients
from SMILES codes. To overcome the limitations of available experimental data,
we initially train our network on a large dataset of synthetic data sampled
from COSMO-RS (10 Million data points) and then fine-tune the model on
experimental data (20 870 data points). This training strategy enables SPT to
accurately predict limiting activity coefficients even for unknown molecules,
cutting the mean prediction error in half compared to state-of-the-art models
for activity coefficient predictions such as COSMO-RS, UNIFAC, and improving on
recent machine learning approaches.
Related papers
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Transition Role of Entangled Data in Quantum Machine Learning [51.6526011493678]
Entanglement serves as the resource to empower quantum computing.
Recent progress has highlighted its positive impact on learning quantum dynamics.
We establish a quantum no-free-lunch (NFL) theorem for learning quantum dynamics using entangled data.
arXiv Detail & Related papers (2023-06-06T08:06:43Z) - Machine learning enabled experimental design and parameter estimation
for ultrafast spin dynamics [54.172707311728885]
We introduce a methodology that combines machine learning with Bayesian optimal experimental design (BOED)
Our method employs a neural network model for large-scale spin dynamics simulations for precise distribution and utility calculations in BOED.
Our numerical benchmarks demonstrate the superior performance of our method in guiding XPFS experiments, predicting model parameters, and yielding more informative measurements within limited experimental time.
arXiv Detail & Related papers (2023-06-03T06:19:20Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Development and Evaluation of Conformal Prediction Methods for QSAR [0.5161531917413706]
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds.
Most machine learning (ML) algorithms that achieve superior predictive performance require some add-on methods for estimating uncertainty of their prediction.
Conformal prediction (CP) is a promising approach. It is agnostic to the prediction algorithm and can produce valid prediction intervals under some weak assumptions on the data distribution.
arXiv Detail & Related papers (2023-04-03T13:41:09Z) - Predictive Scale-Bridging Simulations through Active Learning [43.48102250786867]
We use an active learning approach to optimize the use of local fine-scale simulations for informing coarse-scale hydrodynamics.
Our approach addresses three challenges: forecasting continuum coarse-scale trajectory, dynamically updating coarse-scale from fine-scale calculations, and quantifying uncertainty in neural network models.
arXiv Detail & Related papers (2022-09-20T15:58:50Z) - SPT-NRTL: A physics-guided machine learning model to predict
thermodynamically consistent activity coefficients [0.12352483741564477]
We introduce SPT-NRTL, a machine learning model to predict thermodynamically consistent activity coefficients.
SPT-NRTL achieves higher accuracy than UNIFAC in the prediction of activity coefficients across all functional groups.
arXiv Detail & Related papers (2022-09-09T06:21:05Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - Building Robust Machine Learning Models for Small Chemical Science Data:
The Case of Shear Viscosity [3.4761212729163313]
We train several Machine Learning models to predict the shear viscosity of a Lennard-Jones (LJ) fluid.
Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated.
arXiv Detail & Related papers (2022-08-23T07:33:14Z) - Machine Learning in Thermodynamics: Prediction of Activity Coefficients
by Matrix Completion [34.7384528263504]
We propose a probabilistic matrix factorization model for predicting the activity coefficients in arbitrary binary mixtures.
Our method outperforms the state-of-the-art method that has been refined over three decades.
This opens perspectives to novel methods for predicting physico-chemical properties of binary mixtures.
arXiv Detail & Related papers (2020-01-29T03:16:23Z) - Localized Debiased Machine Learning: Efficient Inference on Quantile
Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference.
Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances.
We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.