Related papers: FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness

URL: http://arxiv.org/abs/2601.01332v1
Date: Sun, 04 Jan 2026 02:33:30 GMT
Title: FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness
Authors: Hossam Amer, Maryam Dialameh, Hossein Rajabzadeh, Walid Ahmed, Weiwei Zhang, Yang Liu,
Abstract summary: Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models.<n>We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model.<n>Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy.
Score: 5.2612663135589175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92\% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.

Related papers

When to Stop Federated Learning: Zero-Shot Generation of Synthetic Validation Data with Generative AI for Early Stopping [5.0740578889286105]
Federated Learning (FL) enables collaborative model training across decentralized devices.<n>We introduce a zero-shot synthetic validation framework that leverages generative AI to monitor model performance.<n>Our approach adaptively stops training near the optimal round, thereby conserving computational resources.
arXiv Detail & Related papers (2025-11-14T12:07:32Z)
Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z)
Instance-dependent Early Stopping [57.912273923450726]
We propose an Instance-dependent Early Stopping (IES) method that adapts the early stopping mechanism from the entire training set to the instance level.<n>IES considers an instance as mastered if the second-order differences of its loss value remain within a small range around zero.<n>IES can reduce backpropagation instances by 10%-50% while maintaining or even slightly improving the test accuracy and transfer learning performance of a model.
arXiv Detail & Related papers (2025-02-11T13:34:09Z)
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations [62.132347451049455]
Scale has become a main ingredient in obtaining strong machine learning models. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule. We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
arXiv Detail & Related papers (2024-05-28T17:33:54Z)
Better Schedules for Low Precision Training of Deep Neural Networks [13.88763215392452]
cyclic precision training (CPT) dynamically adjusts precision throughout training according to a cyclic schedule. CPT achieves particularly impressive improvements in training efficiency, while actually improving DNN performance.
arXiv Detail & Related papers (2024-03-04T17:33:39Z)
Always-Sparse Training by Growing Connections with Guided Stochastic Exploration [43.26615926465987]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models.<n>We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z)
Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.