Related papers: Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets

Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets

URL: http://arxiv.org/abs/2511.15222v1
Date: Wed, 19 Nov 2025 08:16:10 GMT
Title: Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets
Authors: Pol Benítez, Cibrán López, Edgardo Saucedo, Teruyasu Mizoguchi, Claudio Cazorla,
Abstract summary: We assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets.<n>As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials.<n>We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points.
Score: 0.32622301272834514
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performance of ML models critically depends on the quality, size, and diversity of the training dataset. In materials science, this dependence is particularly important for learning from low-symmetry atomistic configurations that capture thermal excitations, structural defects, and chemical disorder, features that are ubiquitous in real materials but underrepresented in most datasets. The absence of systematic strategies for generating representative training data may therefore limit the predictive power of ML models in technologically critical fields such as energy conversion and photonics. In this work, we assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets: one composed of randomly generated atomic configurations and another constructed using physically informed sampling based on lattice vibrations. As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials under realistic finite-temperature conditions. We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points. Explainability analyses further reveal that high-performing models assign greater weight to chemically meaningful bonds that control property variations, underscoring the importance of physically guided data generation. Overall, this work demonstrates that larger datasets do not necessarily yield better GNN predictive models and introduces a simple and general strategy for efficiently constructing high-quality training data in materials informatics.

Related papers

Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z)
From Physics to Machine Learning and Back: Part II - Learning and Observational Bias in PHM [52.64097278841485]
Review examines how incorporating learning and observational biases through physics-informed modeling and data strategies can guide models toward physically consistent and reliable predictions.<n>Fast adaptation methods including meta-learning and few-shot learning are reviewed alongside domain generalization techniques.
arXiv Detail & Related papers (2025-09-25T14:15:43Z)
Surface Stability Modeling with Universal Machine Learning Interatomic Potentials: A Comprehensive Cleavage Energy Benchmarking Study [0.0]
Machine learning interatomic potentials (MLIPs) have revolutionized computational materials science.<n>No systematic evaluation has assessed how well these universal MLIPs can predict cleavage energies.<n>We present a benchmark of 19 state-of-the-art uMLIPs for cleavage energy prediction.
arXiv Detail & Related papers (2025-08-29T14:24:47Z)
Computational, Data-Driven, and Physics-Informed Machine Learning Approaches for Microstructure Modeling in Metal Additive Manufacturing [0.0]
Metal additive manufacturing enables unprecedented design freedom and the production of customized, complex components.<n>The rapid melting and solidification dynamics inherent to metal AM processes generate heterogeneous, non-equilibrium microstructures.<n>Predicting microstructure and its evolution across spatial and temporal scales remains a central challenge for process optimization and defect mitigation.
arXiv Detail & Related papers (2025-05-02T17:59:54Z)
Data Fusion of Deep Learned Molecular Embeddings for Property Prediction [41.99844472131922]
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency.<n>To improve predictions, techniques such as transfer learning and multitask learning have been used.<n>Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties.<n>We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules and a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.
arXiv Detail & Related papers (2025-04-09T21:40:15Z)
Machine learning surrogate models of many-body dispersion interactions in polymer melts [40.83978401377059]
We introduce a machine learning surrogate model specifically designed to predict MBD forces in polymer melts.<n>Our model is based on a trimmed SchNet architecture that selectively retains the most relevant atomic connections.<n>Characterized by high computational efficiency, our surrogate model enables practical incorporation of MBD effects into large-scale molecular simulations.
arXiv Detail & Related papers (2025-03-19T12:15:35Z)
Foundation Model for Composite Microstructures: Reconstruction, Stiffness, and Nonlinear Behavior Prediction [0.0]
We present the Material Masked Autoencoder (MMAE), a self-supervised Vision Transformer pretrained on a large corpus of short-fiber composite images.<n>We demonstrate two key applications: (i) predicting homogenized stiffness components through fine-tuning on limited data, and (ii) inferring physically interpretable parameters by coupling MMAE with an interaction-based material network.
arXiv Detail & Related papers (2024-11-10T19:06:25Z)
Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling [38.53065398127086]
We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features. We find that models pretrained on atomic quantum mechanical properties capture more low-frequency laplacian eigenmodes.
arXiv Detail & Related papers (2024-10-10T15:20:30Z)
Self-supervised learning for crystal property prediction via denoising [43.148818844265236]
We propose a novel self-supervised learning (SSL) strategy for material property prediction. Our approach, crystal denoising self-supervised learning (CDSSL), pretrains predictive models with a pretext task based on recovering valid material structures. We demonstrate that CDSSL models out-perform models trained without SSL, across material types, properties, and dataset sizes.
arXiv Detail & Related papers (2024-08-30T12:53:40Z)
Estimation of Electronic Band Gap Energy From Material Properties Using Machine Learning [0.0]
We present a machine learning-driven model capable of swiftly predicting material band gap energy. Our model does not require any preliminary DFT-based calculation or knowledge of the structure of the material. A new evaluation scheme for comparing the performance of ML-based models in material sciences is introduced.
arXiv Detail & Related papers (2024-03-08T07:32:28Z)
Electronic Structure Prediction of Multi-million Atom Systems Through Uncertainty Quantification Enabled Transfer Learning [5.4875371069660925]
Ground state electron density -- obtainable using Kohn-Sham Density Functional Theory (KS-DFT) simulations -- contains a wealth of material information. However, the computational expense of KS-DFT scales cubically with system size which tends to stymie training data generation. Here, we address this fundamental challenge by employing transfer learning to leverage the multi-scale nature of the training data.
arXiv Detail & Related papers (2023-08-24T21:41:29Z)
Synthetic pre-training for neural-network interatomic potentials [0.0]
We show that synthetic atomistic data, themselves obtained at scale with an existing machine learning potential, constitute a useful pre-training task for neural-network interatomic potential models. Once pre-trained with a large synthetic dataset, these models can be fine-tuned on a much smaller, quantum-mechanical one, improving numerical accuracy and stability in computational practice.
arXiv Detail & Related papers (2023-07-24T17:16:24Z)
Quantum-tailored machine-learning characterization of a superconducting qubit [50.591267188664666]
We develop an approach to characterize the dynamics of a quantum device and learn device parameters. This approach outperforms physics-agnostic recurrent neural networks trained on numerically generated and experimental data. This demonstration shows how leveraging domain knowledge improves the accuracy and efficiency of this characterization task.
arXiv Detail & Related papers (2021-06-24T15:58:57Z)
On Energy-Based Models with Overparametrized Shallow Neural Networks [44.74000986284978]
Energy-based models (EBMs) are a powerful framework for generative modeling. In this work we focus on shallow neural networks. We show that models trained in the so-called "active" regime provide a statistical advantage over their associated "lazy" or kernel regime.
arXiv Detail & Related papers (2021-04-15T15:34:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.