Related papers: Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models

URL: http://arxiv.org/abs/2509.20124v1
Date: Wed, 24 Sep 2025 13:49:44 GMT
Title: Probability Signature: Bridging Data Semantics and Embedding Structure in Language Models
Authors: Junjie Yao, Zhi-Qin John Xu,
Abstract summary: We propose a set of probability signatures that reflect the semantic relationships among tokens.<n>We generalize our work to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus.
Score: 8.87728727154868
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The embedding space of language models is widely believed to capture the semantic relationships; for instance, embeddings of digits often exhibit an ordered structure that corresponds to their natural sequence. However, the mechanisms driving the formation of such structures remain poorly understood. In this work, we interpret the embedding structures via the data distribution. We propose a set of probability signatures that reflect the semantic relationships among tokens. Through experiments on the composite addition tasks using the linear model and feedforward network, combined with theoretical analysis of gradient flow dynamics, we reveal that these probability signatures significantly influence the embedding structures. We further generalize our analysis to large language models (LLMs) by training the Qwen2.5 architecture on the subsets of the Pile corpus. Our results show that the probability signatures are faithfully aligned with the embedding structures, particularly in capturing strong pairwise similarities among embeddings. Our work uncovers the mechanism of how data distribution guides the formation of embedding structures, establishing a novel understanding of the relationship between embedding organization and semantic patterns.

Related papers

Deep networks learn to parse uniform-depth context-free languages from local statistics [12.183764229746926]
Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning.<n>We introduce a class of context-free grammars (PCFGs) in which both the degree of ambiguity and the correlation structure across scales can be controlled.<n>We propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.
arXiv Detail & Related papers (2026-01-31T17:35:06Z)
From Static Structures to Ensembles: Studying and Harnessing Protein Structure Tokenization [15.864659611818661]
Protein structure tokenization converts 3D structures into discrete or vectorized representations.<n>Despite many recent works on structure tokenization, the properties of the underlying discrete representations are not well understood.<n>We show that the successful utilization of structural tokens in a language model for structure prediction depends on using rich, pre-trained sequence embeddings.
arXiv Detail & Related papers (2025-11-13T07:58:24Z)
Information Structure in Mappings: An Approach to Learning, Representation, and Generalisation [3.8073142980733]
This thesis introduces quantitative methods for identifying systematic structure in a mapping between spaces.<n>I identify structural primitives present in a mapping, along with information theoretics of each.<n>I also introduce a novel, performant, approach to estimating the entropy of vector space, that allows this analysis to be applied to models ranging in size from 1 million to 12 billion parameters.
arXiv Detail & Related papers (2025-05-29T19:27:50Z)
Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures [49.19753720526998]
We derive theoretical scaling laws for neural network performance on synthetic datasets.<n>We validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance.<n>This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
arXiv Detail & Related papers (2025-05-11T17:44:14Z)
How Compositional Generalization and Creativity Improve as Diffusion Models are Trained [82.08869888944324]
How many samples do generative models need in order to learn composition rules?<n>What signal in the data is exploited to learn those rules?<n>We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.
arXiv Detail & Related papers (2025-02-17T18:06:33Z)
Learning with Hidden Factorial Structure [2.474908349649168]
Recent advances suggest that text and image data contain such hidden structures, which help mitigate the curse of dimensionality.<n>We present a controlled experimental framework to test whether neural networks can indeed exploit such "hidden factorial structures"
arXiv Detail & Related papers (2024-11-02T22:32:53Z)
Compositional Structures in Neural Embedding and Interaction Decompositions [101.40245125955306]
We describe a basic correspondence between linear algebraic structures within vector embeddings in artificial neural networks. We introduce a characterization of compositional structures in terms of "interaction decompositions" We establish necessary and sufficient conditions for the presence of such structures within the representations of a model.
arXiv Detail & Related papers (2024-07-12T02:39:50Z)
On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL [8.57550491437633]
This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model's ability to mimic human-designed processes such as schema linking and syntax prediction. We also uncover insights into the model's internal mechanisms, including the ego-centric nature of structure node encodings.
arXiv Detail & Related papers (2024-04-03T01:16:20Z)
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models [110.00434385712786]
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs) We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice.
arXiv Detail & Related papers (2023-02-28T08:11:56Z)
Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs. Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z)
On Neural Architecture Inductive Biases for Relational Tasks [76.18938462270503]
We introduce a simple architecture based on similarity-distribution scores which we name Compositional Network generalization (CoRelNet) We find that simple architectural choices can outperform existing models in out-of-distribution generalizations.
arXiv Detail & Related papers (2022-06-09T16:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.