Physics in Next-token Prediction
- URL: http://arxiv.org/abs/2411.00660v2
- Date: Sat, 16 Nov 2024 06:17:37 GMT
- Title: Physics in Next-token Prediction
- Authors: Hongjun An, Yiliang Song, Xuelong Li,
- Abstract summary: We identified the law of information conservation within Next-token Prediction (NTP)
We proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is a process of information transfer.
We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption.
- Score: 46.953251684423165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we demonstrate the consistency between the Law of Information Capacity and the Scaling Law for Neural Language Models, the Knowledge Capacity Scaling Laws, and the Scaling Laws for Precision.
Related papers
- Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks [100.13335639780415]
Scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales.<n>Existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications.<n>In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation.
arXiv Detail & Related papers (2025-05-29T03:19:17Z) - Scaling Laws for Precision [73.24325358259753]
We devise "precision-aware" scaling laws for both training and inference.
For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data.
For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions.
arXiv Detail & Related papers (2024-11-07T00:10:10Z) - The Foundations of Tokenization: Statistical and Computational Concerns [51.370165245628975]
Tokenization is a critical step in the NLP pipeline.
Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood.
The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models.
arXiv Detail & Related papers (2024-07-16T11:12:28Z) - The Information of Large Language Model Geometry [3.4003124816653143]
We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes.
We propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon.
arXiv Detail & Related papers (2024-02-01T12:50:43Z) - Harnessing the Power of Neural Operators with Automatically Encoded Conservation Laws [14.210553163356131]
We introduce conservation law-encoded neural operators (clawNOs)
ClawNOs are compliant with the most fundamental and ubiquitous conservation laws essential for correct physical consistency.
They significantly outperform the state-of-the-art NOs in learning efficacy, especially in small-data regimes.
arXiv Detail & Related papers (2023-12-18T13:21:49Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - Knowledge is Power: Understanding Causality Makes Legal judgment
Prediction Models More Generalizable and Robust [3.555105847974074]
Legal Judgment Prediction (LJP) serves as legal assistance to mitigate the great work burden of limited legal practitioners.
Most existing methods apply various large-scale pre-trained language models finetuned in LJP tasks to obtain consistent improvements.
We discover that the state-of-the-art (SOTA) model makes judgment predictions according to irrelevant (or non-casual) information.
arXiv Detail & Related papers (2022-11-06T07:03:31Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via
Intermediary Latents [60.785317191131284]
We introduce a simple and effective method for learning VAEs with controllable biases by using an intermediary set of latent variables.
In particular, it allows us to impose desired properties like sparsity or clustering on learned representations.
We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.
arXiv Detail & Related papers (2021-06-25T16:34:05Z) - Parsimonious neural networks learn interpretable physical laws [77.34726150561087]
We propose parsimonious neural networks (PNNs) that combine neural networks with evolutionary optimization to find models that balance accuracy with parsimony.
The power and versatility of the approach is demonstrated by developing models for classical mechanics and to predict the melting temperature of materials from fundamental properties.
arXiv Detail & Related papers (2020-05-08T16:15:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.