InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
- URL: http://arxiv.org/abs/2506.07406v1
- Date: Mon, 09 Jun 2025 03:59:28 GMT
- Title: InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
- Authors: Yifan Luo, Zhennan Zhou, Bin Dong,
- Abstract summary: InverseScope is an assumption-light and scalable framework for interpreting neural activations via input inversion.<n>To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture.<n>We also introduce a quantitative evaluation protocol that tests interpretability hypotheses using feature consistency rate computed over the sampled inputs.
- Score: 6.841889611296894
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded features. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous methods. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using feature consistency rate computed over the sampled inputs. InverseScope scales inversion-based interpretability methods to larger models and practical tasks, enabling systematic and quantitative analysis of internal representations in real-world LLMs.
Related papers
- AICO: Feature Significance Tests for Supervised Learning [0.5142666700569699]
This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm.<n>We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals.<n>Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data illustrate its practical utility.
arXiv Detail & Related papers (2025-06-29T21:15:40Z) - Efficient Latent Semantic Clustering for Scaling Test-Time Computation of LLMs [14.34599799034748]
Scaling test-time computation has become a promising strategy for improving the reliability and quality of large language models.<n>A key shared component is semantic clustering, which groups outputs that differ in form but convey the same meaning.<n>We propose Latent Semantic Clustering (LSC), a lightweight and context-sensitive method that leverages the generator LLM's internal hidden states for clustering.
arXiv Detail & Related papers (2025-05-31T02:08:32Z) - Probabilistic Lexical Manifold Construction in Large Language Models via Hierarchical Vector Field Interpolation [0.0]
The proposed methodology constructs a probabilistic function space where word representations adhere to topological consistency.<n>Probability constraints enhance lexical coherence by refining contextual relationships, leading to improvements in semantic stability across multiple linguistic distributions.<n>An assessment of computational efficiency reveals that while representations introduces minor processing overhead, the structured representation learning approach remains scalable for practical deployment.
arXiv Detail & Related papers (2025-02-14T08:47:10Z) - Latent Lexical Projection in Large Language Models: A Novel Approach to Implicit Representation Refinement [0.0]
Latent Lexical Projection (LLP) is introduced to refine lexical representations through a structured transformation into a latent space.<n>LLP integrates an optimized projection mechanism within an existing language model architecture.<n> Evaluations indicate a reduction in perplexity and an increase in BLEU scores, suggesting improvements in predictive accuracy and fluency.
arXiv Detail & Related papers (2025-02-03T23:18:53Z) - Latent Thought Models with Variational Bayes Inference-Time Computation [52.63299874322121]
Latent Thought Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - Adaptive Discrete Communication Bottlenecks with Dynamic Vector
Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs.
We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z) - Locally Interpretable Model Agnostic Explanations using Gaussian
Processes [2.9189409618561966]
Local Interpretable Model-Agnostic Explanations (LIME) is a popular technique for explaining the prediction of a single instance.
We propose a Gaussian Process (GP) based variation of locally interpretable models.
We demonstrate that the proposed technique is able to generate faithful explanations using much fewer samples as compared to LIME.
arXiv Detail & Related papers (2021-08-16T05:49:01Z) - Attentional Prototype Inference for Few-Shot Segmentation [128.45753577331422]
We propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation.
We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution.
We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods.
arXiv Detail & Related papers (2021-05-14T06:58:44Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.