The Quest of Finding the Antidote to Sparse Double Descent
- URL: http://arxiv.org/abs/2308.16596v1
- Date: Thu, 31 Aug 2023 09:56:40 GMT
- Title: The Quest of Finding the Antidote to Sparse Double Descent
- Authors: Victor Qu\'etu and Marta Milovanovi\'c
- Abstract summary: As the model's sparsity increases, the performance first worsens, then improves, and finally deteriorates.
Such a non-monotonic behavior raises serious questions about the optimal model's size to maintain high performance.
We show that a simple $ell$ regularization method can help to mitigate this phenomenon but sacrifices the performance/sparsity.
- Score: 1.336445018915526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In energy-efficient schemes, finding the optimal size of deep learning models
is very important and has a broad impact. Meanwhile, recent studies have
reported an unexpected phenomenon, the sparse double descent: as the model's
sparsity increases, the performance first worsens, then improves, and finally
deteriorates. Such a non-monotonic behavior raises serious questions about the
optimal model's size to maintain high performance: the model needs to be
sufficiently over-parametrized, but having too many parameters wastes training
resources.
In this paper, we aim to find the best trade-off efficiently. More precisely,
we tackle the occurrence of the sparse double descent and present some
solutions to avoid it. Firstly, we show that a simple $\ell_2$ regularization
method can help to mitigate this phenomenon but sacrifices the
performance/sparsity compromise. To overcome this problem, we then introduce a
learning scheme in which distilling knowledge regularizes the student model.
Supported by experimental results achieved using typical image classification
setups, we show that this approach leads to the avoidance of such a phenomenon.
Related papers
- The Epochal Sawtooth Effect: Unveiling Training Loss Oscillations in Adam and Other Optimizers [8.770864706004472]
We identify and analyze a recurring training loss pattern, which we term the textitEpochal Sawtooth Effect (ESE)
This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve.
We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect.
arXiv Detail & Related papers (2024-10-14T00:51:21Z) - Can we avoid Double Descent in Deep Neural Networks? [3.1473798197405944]
Double descent has caught the attention of the deep learning community.
It raises serious questions about the optimal model's size to maintain high generalization.
Our work shows that the double descent phenomenon is potentially avoidable with proper conditioning of the learning problem.
arXiv Detail & Related papers (2023-02-26T08:12:28Z) - Controlled Sparsity via Constrained Optimization or: How I Learned to
Stop Tuning Penalties and Love Constraints [81.46143788046892]
We focus on the task of controlling the level of sparsity when performing sparse learning.
Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor.
We propose a constrained formulation where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion.
arXiv Detail & Related papers (2022-08-08T21:24:20Z) - Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and
Cycle Idempotence [76.93002743194974]
We propose a method to treat arbitrary rescaling, both upscaling and downscaling, as one unified process.
The proposed model is able to learn upscaling and downscaling simultaneously and achieve bidirectional arbitrary image rescaling.
It is shown to be robust in cycle idempotence test, free of severe degradations in reconstruction accuracy when the downscaling-to-upscaling cycle is applied repetitively.
arXiv Detail & Related papers (2022-03-02T07:42:15Z) - When in Doubt, Summon the Titans: Efficient Inference with Large Models [80.2673230098021]
We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
arXiv Detail & Related papers (2021-10-19T22:56:49Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - Knowledge distillation: A good teacher is patient and consistent [71.14922743774864]
There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications.
We identify certain implicit design choices, which may drastically affect the effectiveness of distillation.
We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
arXiv Detail & Related papers (2021-06-09T17:20:40Z) - Efficient Iterative Amortized Inference for Learning Symmetric and
Disentangled Multi-Object Representations [8.163697683448811]
We introduce EfficientMORL, an efficient framework for the unsupervised learning of object-centric representations.
We show that optimization challenges caused by requiring both symmetry and disentanglement can be addressed by high-cost iterative amortized inference.
We demonstrate strong object decomposition and disentanglement on the standard multi-object benchmark while achieving nearly an order of magnitude faster training and test time inference.
arXiv Detail & Related papers (2021-06-07T14:02:49Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.