Related papers: I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Related papers

Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z)
LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval [74.72139580745511]
LaSER is a novel self-distillation framework that internalizes explicit reasoning into the latent space of retrievers.<n>Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
arXiv Detail & Related papers (2026-03-02T04:11:18Z)
Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models [39.5490415037017]
Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs)<n>It remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models.
arXiv Detail & Related papers (2026-01-12T23:01:21Z)
Do Sparse Autoencoders Identify Reasoning Features in Language Models? [12.693974363520423]
We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs)<n>We first show through a simple theoretical analysis that $ell_$-regularized SAEs are intrinsically biased toward low-dimensional patterns.<n>Motivated by this bias, we introduce a falsification-oriented evaluation framework to test whether feature activation reflects reasoning processes or superficial linguistic correlates.
arXiv Detail & Related papers (2026-01-09T09:54:36Z)
Eliciting Reasoning in Language Models with Cognitive Tools [9.68459632251626]
We build on the long-standing literature in cognitive psychology and cognitive architectures.<n>We endow an LLM with a small set of "cognitive tools" encapsulating specific reasoning operations.<n>Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-06-13T13:56:52Z)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z)
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction [34.86855316803838]
We identify a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall.<n>We show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation.
arXiv Detail & Related papers (2025-03-29T14:00:44Z)
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)<n>We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.<n>OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Computation Mechanism Behind LLM Position Generalization [59.013857707250814]
Large language models (LLMs) exhibit flexibility in handling textual positions.<n>They can understand texts with position perturbations and generalize to longer texts.<n>This work connects the linguistic phenomenon with LLMs' computational mechanisms.
arXiv Detail & Related papers (2025-03-17T15:47:37Z)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses.<n>We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective.<n>We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z)
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO [0.0]
Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring visual spatial reasoning. We introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation.
arXiv Detail & Related papers (2025-02-20T16:05:18Z)
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.<n>We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.<n>We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
Retrieval-Augmented Semantic Parsing: Using Large Language Models to Improve Generalization [6.948555996661213]
We introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external lexical knowledge into the parsing process.<n>Our experiments show that LLMs outperform previous encoder-decoder baselines for semantic parsing.
arXiv Detail & Related papers (2024-12-13T15:30:20Z)
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models [51.485491249693155]
We first apply a Sparse Autoencoder to disentangle the representations into human understandable features.<n>We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves.
arXiv Detail & Related papers (2024-11-22T14:41:36Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.<n>We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.<n>Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Thought [31.964412924094656]
Large language models (LLMs) have shown exceptional performance as general-purpose assistants. We introduce a novel learning framework, THOUGHT-LIKE-PRO, to facilitate learning and generalization across diverse reasoning tasks. Our empirical findings indicate that our proposed approach substantially enhances the reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-07-18T18:52:10Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
Large Language Models Are Not Strong Abstract Reasoners [12.354660792999269]
Large Language Models have shown tremendous performance on a variety of natural language processing tasks. It is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed. We introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks.
arXiv Detail & Related papers (2023-05-31T04:50:29Z)
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners [75.85554779782048]
Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned textitsemantics of language tokens do the most heavy lifting during the reasoning process.
arXiv Detail & Related papers (2023-05-24T07:33:34Z)
Post Hoc Explanations of Language Models Can Improve Language Models [43.2109029463221]
We present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY) We leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. Our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks.
arXiv Detail & Related papers (2023-05-19T04:46:04Z)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions. Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.