A Mean Field Ansatz for Zero-Shot Weight Transfer
- URL: http://arxiv.org/abs/2408.08681v1
- Date: Fri, 16 Aug 2024 11:53:52 GMT
- Title: A Mean Field Ansatz for Zero-Shot Weight Transfer
- Authors: Xingyuan Chen, Wenwei Kuang, Lei Deng, Wei Han, Bo Bai, Goncalo dos Reis,
- Abstract summary: We introduce a mean field ansatz to provide a theoretical explanation for weight transfer.
We empirically validate the RC ansatz by exploring simple examples and LLMs such as GPT-3 and Llama-3.1.
We show the mean-field point of view is adequate under suitable assumptions which can provide theoretical support for zero-shot weight transfer.
- Score: 9.910243630243079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pre-training cost of large language models (LLMs) is prohibitive. One cutting-edge approach to reduce the cost is zero-shot weight transfer, also known as model growth for some cases, which magically transfers the weights trained in a small model to a large model. However, there are still some theoretical mysteries behind the weight transfer. In this paper, inspired by prior applications of mean field theory to neural network dynamics, we introduce a mean field ansatz to provide a theoretical explanation for weight transfer. Specifically, we propose the row-column (RC) ansatz under the mean field point of view, which describes the measure structure of the weights in the neural network (NN) and admits a close measure dynamic. Thus, the weights of different sizes NN admit a common distribution under proper assumptions, and weight transfer methods can be viewed as sampling methods. We empirically validate the RC ansatz by exploring simple MLP examples and LLMs such as GPT-3 and Llama-3.1. We show the mean-field point of view is adequate under suitable assumptions which can provide theoretical support for zero-shot weight transfer.
Related papers
- RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [95.32315448601241]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)
RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.
Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades [14.113021234825084]
This paper presents a pioneering exploration of the mechanisms underlying large foundation models' weights.
We find that their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns.
We conclude that optimal weights should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers.
arXiv Detail & Related papers (2025-01-18T05:43:17Z) - IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models [68.55148272295916]
We propose IntLoRA, to push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models.
IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting.
arXiv Detail & Related papers (2024-10-29T05:50:17Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Deep Out-of-Distribution Uncertainty Quantification via Weight Entropy Maximization [7.182234028830364]
This paper deals with uncertainty quantification and out-of-distribution detection in deep learning using Bayesian and ensemble methods.
Considering neural networks, a practical optimization is derived to build such a distribution, defined as a trade-off between the average empirical risk and the weight distribution entropy.
arXiv Detail & Related papers (2023-09-27T14:46:10Z) - Probabilistic Weight Fixing: Large-scale training of neural network
weight uncertainties for quantization [7.2282857478457805]
Weight-sharing quantization has emerged as a technique to reduce energy expenditure during inference in large neural networks.
This paper proposes a probabilistic framework based on Bayesian neural networks (BNNs) and a variational relaxation to identify which weights can be moved to which cluster centre.
Our method outperforms the state-of-the-art quantization method top-1 accuracy by 1.6% on ImageNet using DeiT-Tiny.
arXiv Detail & Related papers (2023-09-24T08:04:28Z) - A Theoretical Explanation of Activation Sparsity through Flat Minima and
Adversarial Robustness [29.87592869483743]
A recent empirical observation of activation sparsity in blocks offers an opportunity to drastically reduce computation costs for free.
We propose the notion of sparsity as one source of activation sparsity and a theoretical explanation based on it.
arXiv Detail & Related papers (2023-09-06T13:48:40Z) - Adaptive Distribution Calibration for Few-Shot Learning with
Hierarchical Optimal Transport [78.9167477093745]
We propose a novel distribution calibration method by learning the adaptive weight matrix between novel samples and base classes.
Experimental results on standard benchmarks demonstrate that our proposed plug-and-play model outperforms competing approaches.
arXiv Detail & Related papers (2022-10-09T02:32:57Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.