Fast and Interpretable Machine Learning Modelling of Atmospheric Molecular Clusters
- URL: http://arxiv.org/abs/2509.11728v1
- Date: Mon, 15 Sep 2025 09:29:39 GMT
- Title: Fast and Interpretable Machine Learning Modelling of Atmospheric Molecular Clusters
- Authors: Lauri Seppäläinen, Jakub Kubečka, Jonas Elm, Kai Puolamäki,
- Abstract summary: We show that simple $k$-NN models can rival more complex kernel ridge regression models in accuracy.<n>Our $k$-NN models achieve near-chemical accuracy, scale seamlessly to datasets with over 250,000 entries, and even extrapolate to larger unseen clusters with minimal error.
- Score: 1.771601061061997
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Understanding how atmospheric molecular clusters form and grow is key to resolving one of the biggest uncertainties in climate modelling: the formation of new aerosol particles. While quantum chemistry offers accurate insights into these early-stage clusters, its steep computational costs limit large-scale exploration. In this work, we present a fast, interpretable, and surprisingly powerful alternative: $k$-nearest neighbour ($k$-NN) regression model. By leveraging chemically informed distance metrics, including a kernel-induced metric and one learned via metric learning for kernel regression (MLKR), we show that simple $k$-NN models can rival more complex kernel ridge regression (KRR) models in accuracy, while reducing computational time by orders of magnitude. We perform this comparison with the well-established Faber-Christensen-Huang-Lilienfeld (FCHL19) molecular descriptor, but other descriptors (e.g., FCHL18, MBDF, and CM) can be shown to have similar performance. Applied to both simple organic molecules in the QM9 benchmark set and large datasets of atmospheric molecular clusters (sulphuric acid-water and sulphuric-multibase -base systems), our $k$-NN models achieve near-chemical accuracy, scale seamlessly to datasets with over 250,000 entries, and even appears to extrapolate to larger unseen clusters with minimal error (often nearing 1 kcal/mol). With built-in interpretability and straightforward uncertainty estimation, this work positions $k$-NN as a potent tool for accelerating discovery in atmospheric chemistry and beyond.
Related papers
- Coupled Cluster con MōLe: Molecular Orbital Learning for Neural Wavefunctions [34.53033480184221]
Density functional theory (DFT) is the most widely used method for calculating molecular properties.<n> Coupled-cluster (CC) theory is the most successful method for achieving accuracy beyond DFT.<n>We present the Molecular Orbital Learning (MLe) architecture, an equivariant machine learning model that directly predicts CC's core mathematical objects.
arXiv Detail & Related papers (2026-02-23T18:54:46Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Aligned Manifold Property and Topology Point Clouds for Learning Molecular Properties [55.2480439325792]
This work introduces AMPTCR, a molecular surface representation that combines local quantum-derived scalar fields and custom topological descriptors within an aligned point cloud format.<n>For molecular weight, results confirm that AMPTCR encodes physically meaningful data, with a validation R2 of 0.87.<n>In the bacterial inhibition task, AMPTCR enables both classification and direct regression of E. coli inhibition values.
arXiv Detail & Related papers (2025-07-22T04:35:50Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
We present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task.<n>To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs.<n>Experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - Leveraging Machine Learning to Overcome Limitations in Quantum Algorithms [0.0]
This work presents a hybrid framework combining Machine Learning (ML) techniques with quantum algorithms.<n>Three datasets (chemical descriptors, Coulomb matrices, and a hybrid combination) were prepared using molecular features from PubChem.<n>XGB achieved the lowest Relative Error (RE) of $4.41 pm 11.18%$ on chemical descriptors, outperforming RF ($5.56 pm 11.66%$) and LGBM ($5.32 pm 12.87%$)
arXiv Detail & Related papers (2024-12-16T03:14:14Z) - A Microstructure-based Graph Neural Network for Accelerating Multiscale
Simulations [0.0]
We introduce an alternative surrogate modeling strategy that allows for keeping the multiscale nature of the problem.
We achieve this by predicting full-field microscopic strains using a graph neural network (GNN) while retaining the microscopic material model.
We demonstrate for several challenging scenarios that the surrogate can predict complex macroscopic stress-strain paths.
arXiv Detail & Related papers (2024-02-20T15:54:24Z) - Improving Molecular Properties Prediction Through Latent Space Fusion [9.912768918657354]
We present a multi-view approach that combines latent spaces derived from state-of-the-art chemical models.
Our approach relies on two pivotal elements: the embeddings derived from MHG-GNN, which represent molecular structures as graphs, and MoLFormer embeddings rooted in chemical language.
We demonstrate the superior performance of our proposed multi-view approach compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2023-10-20T20:29:32Z) - Gibbs-Helmholtz Graph Neural Network: capturing the temperature
dependency of activity coefficients at infinite dilution [1.290382979353427]
We develop the Gibbs-Helmholtz Graph Neural Network (GH-GNN) model for predicting $ln gamma_ijinfty$ of molecular systems at different temperatures.
We analyze the performance of GH-GNN for continuous and discrete inter/extrapolation and give indications for the model's applicability domain and expected accuracy.
arXiv Detail & Related papers (2022-12-02T14:25:58Z) - MolGraph: a Python package for the implementation of molecular graphs
and graph neural networks with TensorFlow and Keras [51.92255321684027]
MolGraph is a graph neural network (GNN) package for molecular machine learning (ML)
MolGraph implements a chemistry module to accommodate the generation of small molecular graphs, which can be passed to a GNN algorithm to solve a molecular ML problem.
GNNs proved useful for molecular identification and improved interpretability of chromatographic retention time data.
arXiv Detail & Related papers (2022-08-21T18:37:41Z) - Accurate Machine Learned Quantum-Mechanical Force Fields for
Biomolecular Simulations [51.68332623405432]
Molecular dynamics (MD) simulations allow atomistic insights into chemical and biological processes.
Recently, machine learned force fields (MLFFs) emerged as an alternative means to execute MD simulations.
This work proposes a general approach to constructing accurate MLFFs for large-scale molecular simulations.
arXiv Detail & Related papers (2022-05-17T13:08:28Z) - Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation.
Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings.
Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z) - Predicting molecular dipole moments by combining atomic partial charges
and atomic dipoles [3.0980025155565376]
"MuML" models are fitted together to reproduce molecular $boldsymbolmu$ computed using high-level coupled-cluster theory.
We demonstrate that the uncertainty in the predictions can be estimated reliably using a calibrated committee model.
arXiv Detail & Related papers (2020-03-27T14:35:37Z) - Assessing Graph-based Deep Learning Models for Predicting Flash Point [52.931492216239995]
Graph-based deep learning (GBDL) models were implemented in predicting flash point for the first time.
Average R2 and Mean Absolute Error (MAE) scores of MPNN are, respectively, 2.3% lower and 2.0 K higher than previous comparable studies.
arXiv Detail & Related papers (2020-02-26T06:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.