Related papers: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

URL: http://arxiv.org/abs/2507.04886v3
Date: Thu, 31 Jul 2025 21:36:02 GMT
Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
Authors: A. Bochkov,
Abstract summary: We construct Transformer models where the embedding layer is entirely frozen.<n>Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer.<n>Despite the absence of trainable, semantically embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Related papers

Harnessing the Universal Geometry of Embeddings [8.566825612032359]
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches.<n>Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets.
arXiv Detail & Related papers (2025-05-18T20:37:07Z)
Masked Completion via Structured Diffusion with White-Box Transformers [23.07048591213815]
We provide the first instantiation of the white-box design paradigm that can be applied to large-scale unsupervised representation learning. We do this by exploiting a fundamental connection between diffusion, compression, and (masked) completion, deriving a deep transformer-like masked autoencoder architecture. CRATE-MAE demonstrates highly promising performance on large-scale imagery datasets.
arXiv Detail & Related papers (2024-04-03T04:23:01Z)
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models [86.75729087623259]
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention. This work introduces Pushdown Layers, a new self-attention layer. Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization.
arXiv Detail & Related papers (2023-10-29T17:27:18Z)
Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z)
Substance or Style: What Does Your Image Embedding Know? [55.676463077772866]
Image foundation models have primarily been evaluated for semantic content. We measure the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations. We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE)
arXiv Detail & Related papers (2023-07-10T22:40:10Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
AAformer: Auto-Aligned Transformer for Person Re-Identification [82.45385078624301]
We introduce an alignment scheme in transformer architecture for the first time. We propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level. AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval.
arXiv Detail & Related papers (2021-04-02T08:00:25Z)
Improved Biomedical Word Embeddings in the Transformer Era [2.978663539080876]
We learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts.
arXiv Detail & Related papers (2020-12-22T03:03:50Z)
Unsupervised Distillation of Syntactic Information from Contextualized Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations. To this end, we automatically generate groups of sentences which are structurally similar but semantically different. We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis [194.1452124186117]
We propose a novel ECGAN for the challenging semantic image synthesis task. Our ECGAN achieves significantly better results than state-of-the-art methods.
arXiv Detail & Related papers (2020-03-31T01:23:21Z)
Stacked DeBERT: All Attention in Incomplete Data for Text Classification [8.900866276512364]
We propose Stacked DeBERT, short for Stacked Denoising Bidirectional Representations from Transformers. Our model shows improved F1-scores and better robustness in informal/incorrect texts present in tweets and in texts with Speech-to-Text error in sentiment and intent classification tasks.
arXiv Detail & Related papers (2020-01-01T04:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.