Related papers: A Markov Categorical Framework for Language Modeling

A Markov Categorical Framework for Language Modeling

URL: http://arxiv.org/abs/2507.19247v3
Date: Mon, 29 Sep 2025 15:08:06 GMT
Title: A Markov Categorical Framework for Language Modeling
Authors: Yifan Zhang,
Abstract summary: Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive.<n>We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories.<n>This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.
Score: 9.910562011343009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes their representations, and enables complex behaviors, remains elusive. We introduce a new analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective provides a unified mathematical language to connect three critical aspects of language modeling that are typically studied in isolation: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework provides a precise information-theoretic rationale for the success of multi-token prediction methods like speculative decoding, quantifying the information surplus a model's hidden state contains about tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective compels the model to learn not just the next word, but also the data's intrinsic conditional uncertainty, a process we formalize using categorical entropy. Our central result shows that, under a linear-softmax head with bounded features, minimizing NLL induces spectral alignment: the learned representation space aligns with the eigenspectrum of a predictive similarity operator. This work presents a powerful new lens for understanding how information flows through a model and how the training objective shapes its internal geometry.

Related papers

Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs [3.429783703166407]
Our framework enables a global dissection of model behavior by analyzing how high-level semantic attributes emerge, interact, and propagate through internal model components.<n>A key innovation is our visualization platform that we named BAGEL, which presents these insights in a structured knowledge graph.<n>Our framework is model-agnostic, scalable, and contributes to a deeper understanding of how deep learning models generalize (or fail to) in the presence of dataset biases.
arXiv Detail & Related papers (2025-07-08T09:30:20Z)
Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures [49.19753720526998]
We derive theoretical scaling laws for neural network performance on synthetic datasets.<n>We validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance.<n>This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
arXiv Detail & Related papers (2025-05-11T17:44:14Z)
Hierarchical Insights: Exploiting Structural Similarities for Reliable 3D Semantic Segmentation [4.480310276450028]
We propose a training strategy for a 3D LiDAR semantic segmentation model that learns structural relationships between classes through abstraction. This is achieved by implicitly modeling these relationships using a learning rule for hierarchical multi-label classification (HMC) Our detailed analysis demonstrates that this training strategy not only improves the model's confidence calibration but also retains additional information useful for downstream tasks such as fusion, prediction, and planning.
arXiv Detail & Related papers (2024-04-09T08:49:01Z)
A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels. We present a generative latent variable model for self-supervised learning. We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z)
Surprisal Driven $k$-NN for Robust and Interpretable Nonparametric Learning [1.4293924404819704]
We shed new light on the traditional nearest neighbors algorithm from the perspective of information theory. We propose a robust and interpretable framework for tasks such as classification, regression, density estimation, and anomaly detection using a single model. Our work showcases the architecture's versatility by achieving state-of-the-art results in classification and anomaly detection.
arXiv Detail & Related papers (2023-11-17T00:35:38Z)
Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals [67.64770842323966]
Causal explanations of predictions of NLP systems are essential to ensure safety and establish trust. Existing methods often fall short of explaining model predictions effectively or efficiently. We propose two approaches for counterfactual (CF) approximation.
arXiv Detail & Related papers (2023-10-01T07:31:04Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
A Recursive Bateson-Inspired Model for the Generation of Semantic Formal Concepts from Spatial Sensory Data [77.34726150561087]
This paper presents a new symbolic-only method for the generation of hierarchical concept structures from complex sensory data. The approach is based on Bateson's notion of difference as the key to the genesis of an idea or a concept. The model is able to produce fairly rich yet human-readable conceptual representations without training.
arXiv Detail & Related papers (2023-07-16T15:59:13Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Constructing Word-Context-Coupled Space Aligned with Associative Knowledge Relations for Interpretable Language Modeling [0.0]
The black-box structure of the deep neural network in pre-trained language models seriously limits the interpretability of the language modeling process. A Word-Context-Coupled Space (W2CSpace) is proposed by introducing the alignment processing between uninterpretable neural representation and interpretable statistical logic. Our language model can achieve better performance and highly credible interpretable ability compared to related state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T09:26:02Z)
Learning with Explanation Constraints [91.23736536228485]
We provide a learning theoretic framework to analyze how explanations can improve the learning of our models. We demonstrate the benefits of our approach over a large array of synthetic and real-world experiments.
arXiv Detail & Related papers (2023-03-25T15:06:47Z)
Principled and Efficient Motif Finding for Structure Learning of Lifted Graphical Models [5.317624228510748]
Structure learning is a core problem in AI central to the fields of neuro-symbolic AI and statistical relational learning. We present the first principled approach for mining structural motifs in lifted graphical models. We show that we outperform state-of-the-art structure learning approaches by up to 6% in terms of accuracy and up to 80% in terms of runtime.
arXiv Detail & Related papers (2023-02-09T12:21:55Z)
Experimental Observations of the Topology of Convolutional Neural Network Activations [2.4235626091331737]
Topological data analysis provides compact, noise-robust representations of complex structures. Deep neural networks (DNNs) learn millions of parameters associated with a series of transformations defined by the model architecture. In this paper, we apply cutting edge techniques from TDA with the goal of gaining insight into the interpretability of convolutional neural networks used for image classification.
arXiv Detail & Related papers (2022-12-01T02:05:44Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models [39.479912987123214]
Self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. We introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept. We show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data.
arXiv Detail & Related papers (2022-10-27T02:57:26Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Geometric and Topological Inference for Deep Representations of Complex Networks [13.173307471333619]
We present a class of statistics that emphasize the topology as well as the geometry of representations. We evaluate these statistics in terms of the sensitivity and specificity that they afford when used for model selection. These new methods enable brain and computer scientists to visualize the dynamic representational transformations learned by brains and models.
arXiv Detail & Related papers (2022-03-10T17:14:14Z)
Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models. Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z)
Model-agnostic multi-objective approach for the evolutionary discovery of mathematical models [55.41644538483948]
In modern data science, it is more interesting to understand the properties of the model, which parts could be replaced to obtain better results. We use multi-objective evolutionary optimization for composite data-driven model learning to obtain the algorithm's desired properties.
arXiv Detail & Related papers (2021-07-07T11:17:09Z)
From Canonical Correlation Analysis to Self-supervised Graph Neural Networks [99.44881722969046]
We introduce a conceptually simple yet effective model for self-supervised representation learning with graph data. We optimize an innovative feature-level objective inspired by classical Canonical Correlation Analysis. Our method performs competitively on seven public graph datasets.
arXiv Detail & Related papers (2021-06-23T15:55:47Z)
Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning [20.506232306308977]
Latent structure models are a powerful tool for modeling language data. One challenge with end-to-end training of these models is the argmax operation, which has null gradient. We explore latent structure learning through the angle of pulling back the downstream learning objective.
arXiv Detail & Related papers (2020-10-05T21:56:00Z)
Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking [63.49779304362376]
Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. We introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. We show that we can drop a large proportion of edges without deteriorating the performance of the model.
arXiv Detail & Related papers (2020-10-01T17:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.