Related papers: Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials

URL: http://arxiv.org/abs/2602.08849v1
Date: Mon, 09 Feb 2026 16:16:22 GMT
Title: Cutting Through the Noise: On-the-fly Outlier Detection for Robust Training of Machine Learning Interatomic Potentials
Authors: Terry C. W. Lam, Niamh O'Neill, Christoph Schran, Lars L. Schaaf,
Abstract summary: We introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations.<n>We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead.<n>We validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three.
Score: 0.6999740786886536
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.

Related papers

Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles [16.678827833121602]
Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling.<n>We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator.
arXiv Detail & Related papers (2025-12-02T04:36:13Z)
Z-Error Loss for Training Neural Networks [0.0]
Outliers introduce significant training challenges in neural networks by propagating erroneous gradients, which can degrade model performance and generalization.<n>We propose the Z-Error Loss, a statistically principled approach that minimizes outlier influence during training by masking the contribution of data points identified as out-of-distribution within each batch.
arXiv Detail & Related papers (2025-06-02T18:35:30Z)
Time Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs [0.0]
diffusion probabilistic models are able to generate synthetic sensor signals.<n>The training process is controlled by a loss function which measures the difference between the noise that was added in the forward process and the noise that was predicted by the diffusion model.<n>We examine multiple similarity metrics and adapt an existing metric to overcome this issue by monitoring the training and synthetisation process.
arXiv Detail & Related papers (2025-05-20T06:38:17Z)
DispFormer: A Pretrained Transformer Incorporating Physical Constraints for Dispersion Curve Inversion [56.64622091009756]
This study introduces DispFormer, a transformer-based neural network for $v_s$ profile inversion from Rayleigh-wave phase and group dispersion curves.<n>DispFormer processes dispersion data independently at each period, allowing it to handle varying lengths without requiring network modifications or strict alignment between training and testing datasets.
arXiv Detail & Related papers (2025-01-08T09:08:24Z)
SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
EntropyStop: Unsupervised Deep Outlier Detection with Loss Entropy [19.154826741973277]
We propose a zero-label entropy metric named Loss Entropy for loss distribution, enabling us to infer optimal stopping points for training without labels. We also develop an automated early-stopping algorithm, EntropyStop, which halts training when loss entropy suggests the maximum model detection capability.
arXiv Detail & Related papers (2024-05-21T05:17:43Z)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z)
Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold. We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples. We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z)
Robust Training under Label Noise by Over-parameterization [41.03008228953627]
We propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted. The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data. Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets.
arXiv Detail & Related papers (2022-02-28T18:50:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.