Revisiting Checkpoint Averaging for Neural Machine Translation
- URL: http://arxiv.org/abs/2210.11803v1
- Date: Fri, 21 Oct 2022 08:29:23 GMT
- Title: Revisiting Checkpoint Averaging for Neural Machine Translation
- Authors: Yingbo Gao, Christian Herold, Zijian Yang, Hermann Ney
- Abstract summary: Checkpoint averaging is a simple and effective method to boost the performance of converged neural machine translation models.
In this work, we revisit the concept of checkpoint averaging and consider several extensions.
- Score: 44.37101354412253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Checkpoint averaging is a simple and effective method to boost the
performance of converged neural machine translation models. The calculation is
cheap to perform and the fact that the translation improvement almost comes for
free, makes it widely adopted in neural machine translation research. Despite
the popularity, the method itself simply takes the mean of the model parameters
from several checkpoints, the selection of which is mostly based on empirical
recipes without many justifications. In this work, we revisit the concept of
checkpoint averaging and consider several extensions. Specifically, we
experiment with ideas such as using different checkpoint selection strategies,
calculating weighted average instead of simple mean, making use of gradient
information and fine-tuning the interpolation weights on development data. Our
results confirm the necessity of applying checkpoint averaging for optimal
performance, but also suggest that the landscape between the converged
checkpoints is rather flat and not much further improvement compared to simple
averaging is to be obtained.
Related papers
- Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging [2.9761595094633435]
Checkpoint merging is a technique for combining multiple model snapshots into a single superior model.
This paper explores checkpoint merging in the context of parameter-efficient fine-tuning.
We propose Metrics-Weighted Averaging (MWA) to merge model checkpoints by weighting their parameters according to performance metrics.
arXiv Detail & Related papers (2025-04-23T05:11:21Z) - SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence.
We transform the discrete selection problem into a continuous subset optimization framework.
We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z) - FLOPS: Forward Learning with OPtimal Sampling [1.694989793927645]
gradient-based computation methods have recently gained focus for learning with only forward passes, also referred to as queries.
Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling.
We propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency.
arXiv Detail & Related papers (2024-10-08T12:16:12Z) - Using Low-Discrepancy Points for Data Compression in Machine Learning: An Experimental Comparison [0.0]
We explore two methods based on low-discrepancy points to reduce large data sets in order to train neural networks.
The first is the method of Dick and Feischl, which relies on digital nets and an averaging procedure.
We construct a second method, which again uses digital nets, but Voronoi clustering instead of averaging.
arXiv Detail & Related papers (2024-07-10T08:07:55Z) - Graspness Discovery in Clutters for Fast and Accurate Grasp Detection [57.81325062171676]
"graspness" is a quality based on geometry cues that distinguishes graspable areas in cluttered scenes.
We develop a neural network named cascaded graspness model to approximate the searching process.
Experiments on a large-scale benchmark, GraspNet-1Billion, show that our method outperforms previous arts by a large margin.
arXiv Detail & Related papers (2024-06-17T02:06:47Z) - Boost Neural Networks by Checkpoints [9.411567653599358]
We propose a novel method to ensemble the checkpoints of deep neural networks (DNNs)
With the same training budget, our method achieves 4.16% lower error on Cifar-100 and 6.96% on Tiny-ImageNet with ResNet-110 architecture.
arXiv Detail & Related papers (2021-10-03T09:14:15Z) - Rethinking Counting and Localization in Crowds:A Purely Point-Based
Framework [59.578339075658995]
We propose a purely point-based framework for joint crowd counting and individual localization.
We design an intuitive solution under this framework, which is called Point to Point Network (P2PNet)
arXiv Detail & Related papers (2021-07-27T11:41:50Z) - Ranking Neural Checkpoints [57.27352551718646]
This paper is concerned with ranking pre-trained deep neural networks (DNNs) for the transfer learning to a downstream task.
We establish a neural checkpoint ranking benchmark (NeuCRaB) and study some intuitive ranking measures.
Our results suggest that the linear separability of the features extracted by the checkpoints is a strong indicator of transferability.
arXiv Detail & Related papers (2020-11-23T04:05:46Z) - Sequential Changepoint Detection in Neural Networks with Checkpoints [11.763229353978321]
We introduce a framework for online changepoint detection and simultaneous model learning.
It is based on detecting changepoints across time by sequentially performing generalized likelihood ratio tests.
We show improved performance compared to online Bayesian changepoint detection.
arXiv Detail & Related papers (2020-10-06T21:49:54Z) - Making Affine Correspondences Work in Camera Geometry Computation [62.7633180470428]
Local features provide region-to-region rather than point-to-point correspondences.
We propose guidelines for effective use of region-to-region matches in the course of a full model estimation pipeline.
Experiments show that affine solvers can achieve accuracy comparable to point-based solvers at faster run-times.
arXiv Detail & Related papers (2020-07-20T12:07:48Z) - Learning a Unified Sample Weighting Network for Object Detection [113.98404690619982]
Region sampling or weighting is significantly important to the success of modern region-based object detectors.
We argue that sample weighting should be data-dependent and task-dependent.
We propose a unified sample weighting network to predict a sample's task weights.
arXiv Detail & Related papers (2020-06-11T16:19:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.