Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades
- URL: http://arxiv.org/abs/2501.10661v1
- Date: Sat, 18 Jan 2025 05:43:17 GMT
- Title: Unveiling the Mystery of Weight in Large Foundation Models: Gaussian Distribution Never Fades
- Authors: Chongjie Si, Jingjing Jiang, Wei Shen,
- Abstract summary: This paper presents a pioneering exploration of the mechanisms underlying large foundation models' weights.
We find that their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns.
We conclude that optimal weights should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers.
- Score: 14.113021234825084
- License:
- Abstract: This paper presents a pioneering exploration of the mechanisms underlying large foundation models' (LFMs) weights, aiming to simplify AI research. Through extensive observation and analysis on prevailing LFMs, we find that regardless of initialization strategies, their weights predominantly follow a Gaussian distribution, with occasional sharp, inverted T-shaped, or linear patterns. We further discover that the weights share the i.i.d. properties of Gaussian noise, and explore their direct relationship. We find that transformation weights can be derived from Gaussian noise, and they primarily serve to increase the standard deviation of pre-trained weights, with their standard deviation growing with layer depth. In other words, transformation weights broaden the acceptable deviation from the optimal weights, facilitating adaptation to downstream tasks. Building upon the above conclusions, we thoroughly discussed the nature of optimal weights, ultimately concluding that they should exhibit zero-mean, symmetry, and sparsity, with the sparse values being a truncated Gaussian distribution and a few outliers. Our experiments in LFM adaptation and editing demonstrate the effectiveness of these insights. We hope these findings can provide a foundational understanding to pave the way for future advancements in the LFM community.
Related papers
- Revisiting Essential and Nonessential Settings of Evidential Deep Learning [70.82728812001807]
Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation.
We propose Re-EDL, a simplified yet more effective variant of EDL.
arXiv Detail & Related papers (2024-10-01T04:27:07Z) - GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs [51.02233412547456]
We introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW)
Our method updates only salient columns, while injecting Gaussian noise into non-salient ones.
Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget.
arXiv Detail & Related papers (2024-08-27T14:41:14Z) - A Mean Field Ansatz for Zero-Shot Weight Transfer [9.910243630243079]
We introduce a mean field ansatz to provide a theoretical explanation for weight transfer.
We empirically validate the RC ansatz by exploring simple examples and LLMs such as GPT-3 and Llama-3.1.
We show the mean-field point of view is adequate under suitable assumptions which can provide theoretical support for zero-shot weight transfer.
arXiv Detail & Related papers (2024-08-16T11:53:52Z) - Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian
Mixture Models [59.331993845831946]
Diffusion models benefit from instillation of task-specific information into the score function to steer the sample generation towards desired properties.
This paper provides the first theoretical study towards understanding the influence of guidance on diffusion models in the context of Gaussian mixture models.
arXiv Detail & Related papers (2024-03-03T23:15:48Z) - Deep Out-of-Distribution Uncertainty Quantification via Weight Entropy Maximization [7.182234028830364]
This paper deals with uncertainty quantification and out-of-distribution detection in deep learning using Bayesian and ensemble methods.
Considering neural networks, a practical optimization is derived to build such a distribution, defined as a trade-off between the average empirical risk and the weight distribution entropy.
arXiv Detail & Related papers (2023-09-27T14:46:10Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Nonparametric mixture MLEs under Gaussian-smoothed optimal transport
distance [0.39373541926236766]
We adapt the GOT framework instead of its unsmoothed counterpart to approximate the true data generating distribution.
A key step in our analysis is the establishment of a new Jackson-type approximation bound of Gaussian-convoluted Lipschitz functions.
This insight bridges existing techniques of analyzing the nonparametric MLEs and the new GOT framework.
arXiv Detail & Related papers (2021-12-04T20:05:58Z) - Deep Speaker Vector Normalization with Maximum Gaussianality Training [13.310988353839237]
A key problem with deep speaker embedding is that the resulting deep speaker vectors tend to be irregularly distributed.
In previous research, we proposed a deep normalization approach based on a new discriminative normalization flow (DNF) model.
Despite this remarkable success, we empirically found that the latent codes produced by the DNF model are generally neither homogeneous nor Gaussian.
We propose a new Maximum Gaussianality (MG) training approach that directly maximizes the Gaussianality of the latent codes.
arXiv Detail & Related papers (2020-10-30T09:42:06Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Bayesian Deep Learning and a Probabilistic Perspective of Generalization [56.69671152009899]
We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization.
We also propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction.
arXiv Detail & Related papers (2020-02-20T15:13:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.