Vocabulary embeddings organize linguistic structure early in language model training
- URL: http://arxiv.org/abs/2510.07613v1
- Date: Wed, 08 Oct 2025 23:26:22 GMT
- Title: Vocabulary embeddings organize linguistic structure early in language model training
- Authors: Isabel Papadimitriou, Jacob Prince,
- Abstract summary: Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers.<n>Here, we ask: how are the input vocabulary representations of language models structured, and how does this structure evolve over training?<n>We run a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models with semantic, syntactic, and frequency-based metrics over the course of training.
- Score: 3.2661767443292646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.
Related papers
- From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning [2.893006778402251]
We argue that function words play a crucial role in language acquisition due to their distinctive distributional properties.<n>We show that language variants preserving all three properties are more easily acquired by neural learners.
arXiv Detail & Related papers (2026-01-29T02:42:12Z) - Evolution of Concepts in Language Model Pre-Training [53.994470178155105]
We track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders.<n>We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages.
arXiv Detail & Related papers (2025-09-21T18:53:12Z) - Probing Internal Representations of Multi-Word Verbs in Large Language Models [0.0]
This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within large language models (LLMs)<n>We analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like 'give up' and prepositional verbs like 'look at'
arXiv Detail & Related papers (2025-02-07T09:49:13Z) - Unsupervised Morphological Tree Tokenizer [36.584680344291556]
We introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words.<n>Our method is capable of inducing character-level structures that align with morphological rules without annotated training data.<n> Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece.
arXiv Detail & Related papers (2024-06-21T15:35:49Z) - How to Plant Trees in Language Models: Data and Architectural Effects on
the Emergence of Syntactic Inductive Biases [28.58785395946639]
We show that pre-training can teach language models to rely on hierarchical syntactic features when performing tasks after fine-tuning.
We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus.
arXiv Detail & Related papers (2023-05-31T14:38:14Z) - How Do Transformers Learn Topic Structure: Towards a Mechanistic
Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure"
We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z) - Feature-rich multiplex lexical networks reveal mental strategies of
early language learning [0.7111443975103329]
We introduce FEature-Rich MUltiplex LEXical (FERMULEX) networks.
Similarities model heterogenous word associations across semantic/syntactic/phonological aspects of knowledge.
Words are enriched with multi-dimensional feature embeddings including frequency, age of acquisition, length and polysemy.
arXiv Detail & Related papers (2022-01-13T16:44:51Z) - Syntactic Perturbations Reveal Representational Correlates of
Hierarchical Phrase Structure in Pretrained Language Models [22.43510769150502]
It is not entirely clear what aspects of sentence-level syntax are captured by vector-based language representations.
We show that Transformers build sensitivity to larger parts of the sentence along their layers, and that hierarchical phrase structure plays a role in this process.
arXiv Detail & Related papers (2021-04-15T16:30:31Z) - Prototypical Representation Learning for Relation Extraction [56.501332067073065]
This paper aims to learn predictive, interpretable, and robust relation representations from distantly-labeled data.
We learn prototypes for each relation from contextual information to best explore the intrinsic semantics of relations.
Results on several relation learning tasks show that our model significantly outperforms the previous state-of-the-art relational models.
arXiv Detail & Related papers (2021-03-22T08:11:43Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Exploiting Syntactic Structure for Better Language Modeling: A Syntactic
Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances"
Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.