An Active Learning-Based Streaming Pipeline for Reduced Data Training of Structure Finding Models in Neutron Diffractometry
- URL: http://arxiv.org/abs/2506.11100v1
- Date: Fri, 06 Jun 2025 15:48:22 GMT
- Title: An Active Learning-Based Streaming Pipeline for Reduced Data Training of Structure Finding Models in Neutron Diffractometry
- Authors: Tianle Wang, Jorge Ramirez, Cristina Garcia-Cardona, Thomas Proffen, Shantenu Jha, Sudip K. Seal,
- Abstract summary: We introduce a novel batch-mode active learning (AL) policy that uses uncertainty sampling to simulate training data drawn from a probability distribution.<n>We confirm its efficacy in training the same models with about 75% less training data while improving the accuracy.<n>We then discuss the design of an efficient stream-based training workflow that uses this AL policy and present a performance study on two heterogeneous platforms.
- Score: 1.3083205962260995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structure determination workloads in neutron diffractometry are computationally expensive and routinely require several hours to many days to determine the structure of a material from its neutron diffraction patterns. The potential for machine learning models trained on simulated neutron scattering patterns to significantly speed up these tasks have been reported recently. However, the amount of simulated data needed to train these models grows exponentially with the number of structural parameters to be predicted and poses a significant computational challenge. To overcome this challenge, we introduce a novel batch-mode active learning (AL) policy that uses uncertainty sampling to simulate training data drawn from a probability distribution that prefers labelled examples about which the model is least certain. We confirm its efficacy in training the same models with about 75% less training data while improving the accuracy. We then discuss the design of an efficient stream-based training workflow that uses this AL policy and present a performance study on two heterogeneous platforms to demonstrate that, compared with a conventional training workflow, the streaming workflow delivers about 20% shorter training time without any loss of accuracy.
Related papers
- Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Diffusion-Model-Assisted Supervised Learning of Generative Models for
Density Estimation [10.793646707711442]
We present a framework for training generative models for density estimation.
We use the score-based diffusion model to generate labeled data.
Once the labeled data are generated, we can train a simple fully connected neural network to learn the generative model in the supervised manner.
arXiv Detail & Related papers (2023-10-22T23:56:19Z) - Stabilizing Subject Transfer in EEG Classification with Divergence
Estimation [17.924276728038304]
We propose several graphical models to describe an EEG classification task.
We identify statistical relationships that should hold true in an idealized training scenario.
We design regularization penalties to enforce these relationships in two stages.
arXiv Detail & Related papers (2023-10-12T23:06:52Z) - Alleviating the Effect of Data Imbalance on Adversarial Training [26.36714114672729]
We study adversarial training on datasets that obey the long-tailed distribution.
We propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT)
arXiv Detail & Related papers (2023-07-14T07:01:48Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - A Physics-informed Diffusion Model for High-fidelity Flow Field
Reconstruction [0.0]
We propose a diffusion model which only uses high-fidelity data at training.
With different configurations, our model is able to reconstruct high-fidelity data from either a regular low-fidelity sample or a sparsely measured sample.
Our model can produce accurate reconstruction results for 2d turbulent flows based on different input sources without retraining.
arXiv Detail & Related papers (2022-11-26T23:14:18Z) - Towards Robust Dataset Learning [90.2590325441068]
We propose a principled, tri-level optimization to formulate the robust dataset learning problem.
Under an abstraction model that characterizes robust vs. non-robust features, the proposed method provably learns a robust dataset.
arXiv Detail & Related papers (2022-11-19T17:06:10Z) - Stabilizing Machine Learning Prediction of Dynamics: Noise and
Noise-inspired Regularization [58.720142291102135]
Recent has shown that machine learning (ML) models can be trained to accurately forecast the dynamics of chaotic dynamical systems.
In the absence of mitigating techniques, this technique can result in artificially rapid error growth, leading to inaccurate predictions and/or climate instability.
We introduce Linearized Multi-Noise Training (LMNT), a regularization technique that deterministically approximates the effect of many small, independent noise realizations added to the model input during training.
arXiv Detail & Related papers (2022-11-09T23:40:52Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Online Tensor-Based Learning for Multi-Way Data [1.0953917735844645]
A new efficient tensor-based feature extraction, named NeSGD, is proposed for online $CANDECOMP/PARAFAC$ decomposition.
Results show that the proposed methods significantly improved the classification error rates, were able to assimilate the changes in the positive data distribution over time, and maintained a high predictive accuracy in all case studies.
arXiv Detail & Related papers (2020-03-10T02:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.