Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws
- URL: http://arxiv.org/abs/2504.09597v5
- Date: Sat, 17 May 2025 15:36:54 GMT
- Title: Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws
- Authors: Zhixuan Pan, Shaowen Wang, Jian Li,
- Abstract summary: We offer a detailed view of how Large Language Models acquire and store information across increasing model and data scales.<n>Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework.<n>Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors.
- Score: 5.685201910521295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.
Related papers
- How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z) - Detailed balance in large language model-driven agents [1.2687030176231846]
Large language model (LLM)-driven agents are emerging as a powerful new paradigm for solving complex problems.<n>This Letter proposes a method to estimate the underlying generative directionality of LLMs embedded within agents.
arXiv Detail & Related papers (2025-12-10T20:04:23Z) - Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs [7.26032677670473]
Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications.<n>How to open the black-box of LLMs from a theoretical standpoint has become a critical challenge.<n>This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point.
arXiv Detail & Related papers (2025-11-03T03:56:34Z) - Large Language Models as Computable Approximations to Solomonoff Induction [11.811838796672369]
We establish the first formal connection between large language models (LLMs) and Algorithmic Information Theory (AIT)<n>We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws.<n>Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.
arXiv Detail & Related papers (2025-05-21T17:35:08Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [79.01538178959726]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data [0.9284740716447338]
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation.<n>Recent research has shown promising results in leveraging knowledge graphs (KGs) to enhance LLM performance.<n>We have developed different techniques that tightly integrate KG structures and semantics into LLM representations.
arXiv Detail & Related papers (2024-12-14T02:51:47Z) - Large Language Models as Markov Chains [7.078696932669912]
We draw an equivalence between autoregressive transformer-based language models and Markov chains defined on a finite state space.
We relate the obtained results to the pathological behavior observed with LLMs.
Experiments with the most recent Llama and Gemma herds of models show that our theory correctly captures their behavior in practice.
arXiv Detail & Related papers (2024-10-03T17:45:31Z) - Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph [15.129079475322637]
This work unveils the factual information an Large Language Models represents internally for sentence-level claim verification.
We propose an end-to-end framework to decode factual knowledge embedded in token representations from a vector space to a set of ground predicates.
Our framework employs activation patching, a vector-level technique that alters a token representation during inference, to extract encoded knowledge.
arXiv Detail & Related papers (2024-04-04T17:45:59Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - Learning to Generate Explainable Stock Predictions using Self-Reflective
Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions.
A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations.
Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z) - From Understanding to Utilization: A Survey on Explainability for Large
Language Models [27.295767173801426]
This survey underscores the imperative for increased explainability in Large Language Models (LLMs)
Our focus is primarily on pre-trained Transformer-based LLMs, which pose distinctive interpretability challenges due to their scale and complexity.
When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement.
arXiv Detail & Related papers (2024-01-23T16:09:53Z) - Explanation-aware Soft Ensemble Empowers Large Language Model In-context
Learning [50.00090601424348]
Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks.
We propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs.
arXiv Detail & Related papers (2023-11-13T06:13:38Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.
This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.