Related papers: Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation

Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation

URL: http://arxiv.org/abs/2503.02078v2
Date: Sun, 09 Mar 2025 10:27:43 GMT
Title: Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Authors: Jonathan Jacobi, Gal Niv,
Abstract summary: We introduce Superscopes, a technique that amplifies features in models into new contexts.<n>Superscopes enables the interpretation of internal representations that previous methods failed to explain-all without requiring additional training.<n>This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the "features as directions" perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.

Related papers

Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z)
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts [68.48103545146127]
This paper proposes a novel framework for unsupervised exploration of diffusion latent spaces. We directly leverage natural language prompts and image captions to map latent directions. Our method provides a more scalable and interpretable understanding of the semantic knowledge encoded within diffusion models.
arXiv Detail & Related papers (2024-10-25T21:44:51Z)
PromptExp: Multi-granularity Prompt Explanation of Large Language Models [16.259208045898415]
We introduce PromptExp, a framework for multi-granularity prompt explanations by aggregating token-level insights. PromptExp supports both white-box and black-box explanations and extends explanations to higher granularity levels. We evaluate PromptExp in case studies such as sentiment analysis, showing the perturbation-based approach performs best.
arXiv Detail & Related papers (2024-10-16T22:25:15Z)
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models [24.817659341654654]
We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM can be viewed as instances of this framework. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and multihop reasoning error correction.
arXiv Detail & Related papers (2024-01-11T18:33:48Z)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models. A meta-model can learn on self-supervised prompts consisting of tailored demonstrations. Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.