Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems
- URL: http://arxiv.org/abs/2601.06916v1
- Date: Sun, 11 Jan 2026 13:52:28 GMT
- Title: Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems
- Authors: Mohammed Azeez Khan, Aaron D'Souza, Vijay Choyal,
- Abstract summary: We develop an active learning framework that iteratively selects informative training structures for machine-learned interatomic potentials.<n>We show that intelligent data selection strategies can achieve target accuracy with 5-13% fewer labeled samples compared to random baselines.<n>The entire pipeline executes on Google Colab in under 4 hours per system using less than 8 GB of RAM.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient discovery of new materials demands strategies to reduce the number of costly first-principles calculations required to train predictive machine learning models. We develop and validate an active learning framework that iteratively selects informative training structures for machine-learned interatomic potentials (MLIPs) from large, heterogeneous materials databases, specifically the Materials Project and OQMD. Our framework integrates compositional and property-based descriptors with a neural network ensemble model, enabling real-time uncertainty quantification via Query-by-Committee. We systematically compare four selection strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach balancing both objectives. Experiments across four representative material systems (elemental carbon, silicon, iron, and a titanium-oxide compound) with 5 random seeds per configuration demonstrate that diversity sampling consistently achieves competitive or superior performance, with particularly strong advantages on complex systems like titanium-oxide (10.9% improvement, p=0.008). Our results show that intelligent data selection strategies can achieve target accuracy with 5-13% fewer labeled samples compared to random baselines. The entire pipeline executes on Google Colab in under 4 hours per system using less than 8 GB of RAM, thereby democratizing MLIP development for researchers globally with limited computational resources. Our open-source code and detailed experimental configurations are available on GitHub. This multi-system evaluation establishes practical guidelines for data-efficient MLIP training and highlights promising future directions including integration with symmetry-aware neural network architectures.
Related papers
- Towards Sample Efficient Entanglement Classification for 3 and 4 Qubit Systems: A Tailored CNN-BiLSTM Approach [6.448866790627225]
We propose a hybrid neural network architecture integrating Convolutional and Bidirectional Long Short-Term Memory networks (CNN-BiLSTM)<n>This design leverages CNNs for local feature extraction and BiLSTMs for sequential dependency modeling, enabling robust feature learning from minimal training data.<n>When trained on only 100 samples, Architecture 2 maintains classification accuracies exceeding 90% for both 3-qubit and 4-qubit systems, demonstrating rapid loss within tens of epochs.
arXiv Detail & Related papers (2026-01-30T04:59:44Z) - Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z) - MiniCPM4: Ultra-Efficient LLMs on End Devices [126.22958722174583]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
arXiv Detail & Related papers (2025-06-09T16:16:50Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - A robust synthetic data generation framework for machine learning in
High-Resolution Transmission Electron Microscopy (HRTEM) [1.0923877073891446]
Construction Zone is a Python package for rapidly generating complex nanoscale atomic structures.
We develop an end-to-end workflow for creating large simulated databases for training neural networks.
Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles.
arXiv Detail & Related papers (2023-09-12T10:44:15Z) - Stochastic Configuration Machines for Industrial Artificial Intelligence [4.57421617811378]
configuration networks (SCNs) play a key role in industrial artificial intelligence (IAI)
This paper proposes a new randomized learner model, termed configuration machines (SCMs) to stress effective modelling and data size saving.
Experimental studies are carried out over some benchmark datasets and three industrial applications.
arXiv Detail & Related papers (2023-08-25T05:52:41Z) - On the Interplay of Subset Selection and Informed Graph Neural Networks [3.091456764812509]
This work focuses on predicting the molecules atomization energy in the QM9 dataset.
We show how maximizing molecular diversity in the training set selection process increases the robustness of linear and nonlinear regression techniques.
We also check the reliability of the predictions made by the graph neural network with a model-agnostic explainer.
arXiv Detail & Related papers (2023-06-15T09:09:27Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Batch-Ensemble Stochastic Neural Networks for Out-of-Distribution
Detection [55.028065567756066]
Out-of-distribution (OOD) detection has recently received much attention from the machine learning community due to its importance in deploying machine learning models in real-world applications.
In this paper we propose an uncertainty quantification approach by modelling the distribution of features.
We incorporate an efficient ensemble mechanism, namely batch-ensemble, to construct the batch-ensemble neural networks (BE-SNNs) and overcome the feature collapse problem.
We show that BE-SNNs yield superior performance on several OOD benchmarks, such as the Two-Moons dataset, the FashionMNIST vs MNIST dataset, FashionM
arXiv Detail & Related papers (2022-06-26T16:00:22Z) - Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP.
We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.