Related papers: Brain-Grounded Axes for Reading and Steering LLM States

Brain-Grounded Axes for Reading and Steering LLM States

URL: http://arxiv.org/abs/2512.19399v1
Date: Mon, 22 Dec 2025 13:51:03 GMT
Title: Brain-Grounded Axes for Reading and Steering LLM States
Authors: Sandro Andric,
Abstract summary: Interpretability methods for large language models (LLMs) typically derive directions from textual supervision.<n>We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.

Related papers

NeuroMambaLLM: Dynamic Graph Learning of fMRI Functional Connectivity in Autistic Brains Using Mamba and Language Model Reasoning [0.0]
We propose NeuroMambaLLM, an end-to-end framework that integrates dynamic latent graph learning and selective state-space temporal modelling with Large Language Models (LLMs)<n>The proposed method learns the functional connectivity dynamically from raw Blood-Oxygen-Level-Dependent (BOLD) time series, replacing fixed correlation graphs with adaptive latent connectivity while suppressing motion-related artifacts and capturing long-range temporal dependencies.<n>This design enables the LLM to perform both diagnostic classification and language-based reasoning, allowing it to analyze dynamic fMRI patterns and generate clinically meaningful textual reports.
arXiv Detail & Related papers (2026-02-14T13:32:59Z)
MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding [4.158479111055355]
MotionTeller is a generative framework that integrates minute-level wearable activity data with large language models (LLMs)<n>We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens.<n>MotionTeller achieves high semantic fidelity (BERT-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1.
arXiv Detail & Related papers (2025-12-25T04:37:07Z)
Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models [4.7341002297388295]
Implicit feedback, employed in training recommender systems, unavoidably confronts noise due to factors such as misclicks and position bias.<n>Previous studies have attempted to identify noisy samples through their diverged data patterns, such as higher loss values.<n>We observed that noisy samples and hard samples display similar patterns, leading to hard-noisy confusion issue.
arXiv Detail & Related papers (2025-11-10T16:51:03Z)
LLMs Can Get "Brain Rot"! [68.08198331505695]
Continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs)<n>We run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets.<n>Results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay.
arXiv Detail & Related papers (2025-10-15T13:28:49Z)
SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z)
Probing Neural Topology of Large Language Models [12.298921317333452]
We introduce graph probing, a method for uncovering the functional connectivity of large language models.<n>By probing models across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance.<n>Strikingly, probing on topology outperforms probing on activation by up to 130.4%.
arXiv Detail & Related papers (2025-06-01T14:57:03Z)
MID-L: Matrix-Interpolated Dropout Layer with Layer-wise Neuron Selection [0.0]
Matrix-Interpolated Dropout Layer (MID-L) dynamically selects and activates only the most informative neurons.<n>Experiments on six benchmarks, including MNIST, CIFAR-10, CIFAR-100, SVHN, UCI Adult, and IMDB, show that MID-L achieves up to average 55% reduction in active neurons.
arXiv Detail & Related papers (2025-05-16T16:29:19Z)
Steer LLM Latents for Hallucination Detection [29.967245405976072]
We propose a steering vector that reshapes the representation space during inference to separate truthful and hallucinated outputs.<n>Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters.<n>It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process.
arXiv Detail & Related papers (2025-03-01T19:19:34Z)
What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length [61.71625297655583]
We show that MORCELA outperforms a commonly used linking theory for acceptability.<n>Larger models require a lower relative degree of adjustment for unigram frequency.<n>Our analysis shows that larger LMs' lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.
arXiv Detail & Related papers (2024-11-04T19:05:49Z)
Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory [87.62730694973696]
STEEL is the first provably sample-efficient algorithm for learning the controllable dynamics of an Exogenous Block Markov Decision Process from a single trajectory.<n>We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems.
arXiv Detail & Related papers (2024-10-03T21:57:21Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
We show that editing a small subset of parameters can effectively modulate specific behaviors of large language models (LLMs)<n>Our approach achieves reductions of up to 90.0% in toxicity on the RealToxicityPrompts dataset and 49.2% on ToxiGen.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.