Related papers: Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

URL: http://arxiv.org/abs/2507.09709v2
Date: Thu, 21 Aug 2025 17:55:26 GMT
Title: Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi,
Abstract summary: Understanding the latent space geometry of large language models (LLMs) is key to their behavior and alignment.<n>We conduct a large-scale study in 11 empirical models across 6 scientific topics.
Score: 31.401762286885656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across 6 scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings support geometry-aware tools that operate directly in latent space to detect and mitigate harmful or adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states to act as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models [77.98801218316505]
Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning.<n>We investigate the internal processing of LLMs during in-context concept inference.
arXiv Detail & Related papers (2026-02-08T03:14:39Z)
One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces [17.173074024116477]
Embedding spaces are fundamental to modern AI, translating raw data into high-dimensional vectors that encode rich semantic relationships.<n>We introduce the Semantic Field Subspace (SFS), a geometry-preserving, context-aware representation that captures local semantic neighborhoods within the embedding space.<n>We also propose SAFARI, an unsupervised, modality-agnostic algorithm that uncovers hierarchical semantic structures using a novel metric called Semantic Shift.
arXiv Detail & Related papers (2025-11-30T11:48:00Z)
SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs [39.14996705577274]
SCOPE is an inference-time method that requires no parameter updates or auxiliary filters.<n>We identify a copyright-sensitive subspace and clamp its activations during decoding.<n>Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility.
arXiv Detail & Related papers (2025-11-10T11:53:07Z)
Semantic Concentration for Self-Supervised Dense Representations Learning [103.10708947415092]
Image-level self-supervised learning (SSL) has made significant progress, yet learning dense representations for patches remains challenging.<n>This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration.
arXiv Detail & Related papers (2025-09-11T13:12:10Z)
Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning [6.652200654829215]
We learn non-basis-aligned subspaces in an unsupervised manner.<n>Results show that encoded information in obtained subspaces tends to share the same abstract concept across different inputs.<n>We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing.
arXiv Detail & Related papers (2025-08-03T20:59:29Z)
The Geometry of Harmfulness in LLMs through Subconcept Probing [3.6335172274433414]
We introduce a multidimensional framework for probing and steering harmful content in language models.<n>For each of 55 distinct harmfulness subconcepts, we learn a linear probe, yielding 55 interpretable directions in activation space.<n>We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction.
arXiv Detail & Related papers (2025-07-23T07:56:05Z)
FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [78.65988445433844]
FloorplanQA is a diagnostic benchmark for evaluating spatial reasoning in large-language models.<n>The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces.
arXiv Detail & Related papers (2025-07-10T11:16:48Z)
Vector Ontologies as an LLM world view extraction method [0.0]
Large Language Models (LLMs) possess intricate internal representations of the world, yet these structures are notoriously difficult to interpret or repurpose beyond the original prediction task.<n>A vector ontology defines a domain-specific vector space spanned by ontologically meaningful dimensions, allowing geometric analysis of concepts and relationships within a domain.<n>Using GPT-4o-mini, we extract genre representations through multiple natural language prompts and analyze the consistency of these projections across linguistic variations and their alignment with ground-truth data.
arXiv Detail & Related papers (2025-06-16T08:49:21Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
The Origins of Representation Manifolds in Large Language Models [52.68554895844062]
We show that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths.<n>The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
arXiv Detail & Related papers (2025-05-23T13:31:22Z)
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation [2.5976894391099625]
We develop a framework that tracks token dynamics across Transformers layers.<n>This work advances interpretability by reframing Transformers layers as projectors between high-dimensional and low-dimensional semantics.
arXiv Detail & Related papers (2025-03-28T15:47:30Z)
Unraveling the Localized Latents: Learning Stratified Manifold Structures in LLM Embedding Space with Sparse Mixture-of-Experts [3.9426000822656224]
We conjecture that in large language models, the embeddings live in a local manifold structure with different dimensions depending on the perplexities and domains of the input data.<n>By incorporating an attention-based soft-gating network, we verify that our model learns specialized sub-manifolds for an ensemble of input data sources.
arXiv Detail & Related papers (2025-02-19T09:33:16Z)
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z)
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence. We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
The Low-Dimensional Linear Geometry of Contextualized Word Representations [27.50785941238007]
We study the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features are encoded in low-dimensional subspaces.
arXiv Detail & Related papers (2021-05-15T00:58:08Z)
Introducing Orthogonal Constraint in Structural Probes [0.2538209532048867]
We decompose a linear projection of language vector space into isomorphic space rotation and linear scaling directions. We experimentally show that our approach can be performed in a multitask setting.
arXiv Detail & Related papers (2020-12-30T17:14:25Z)
Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.