Related papers: LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical Components

LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical Components

URL: http://arxiv.org/abs/2508.17182v2
Date: Sun, 31 Aug 2025 21:27:41 GMT
Title: LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical Components
Authors: Hikaru Tsujimura, Arush Tagade,
Abstract summary: Large Language Models (LLMs) often display overconfidence, presenting information with unwarranted certainty in high-stakes contexts.<n>We use open-sourced Llama 3.2 models fine-tuned on human annotated assertiveness datasets.<n>Our analysis identifies layers most sensitive to assertiveness contrasts and reveals that high-assertive representations decompose into two sub-components of emotional and logical clusters.
Score: 0.17188280334580197
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) often display overconfidence, presenting information with unwarranted certainty in high-stakes contexts. We investigate the internal basis of this behavior via mechanistic interpretability. Using open-sourced Llama 3.2 models fine-tuned on human annotated assertiveness datasets, we extract residual activations across all layers, and compute similarity metrics to localize assertive representations. Our analysis identifies layers most sensitive to assertiveness contrasts and reveals that high-assertive representations decompose into two orthogonal sub-components of emotional and logical clusters-paralleling the dual-route Elaboration Likelihood Model in Psychology. Steering vectors derived from these sub-components show distinct causal effects: emotional vectors broadly influence prediction accuracy, while logical vectors exert more localized effects. These findings provide mechanistic evidence for the multi-component structure of LLM assertiveness and highlight avenues for mitigating overconfident behavior.

Related papers

Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models [77.98801218316505]
Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning.<n>We investigate the internal processing of LLMs during in-context concept inference.
arXiv Detail & Related papers (2026-02-08T03:14:39Z)
Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient [27.26870804635122]
Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems.<n> internal mechanisms driving these complex reasoning behaviors remain opaque.<n>We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components.
arXiv Detail & Related papers (2026-02-02T16:43:09Z)
E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis [54.763420895859035]
We present ELLM2-EEG-to-Emotion Large Language Model, first MLLM framework for interpretable emotion analysis from EEG.<n>ELLM integrates a pretrained EEG encoder with Q-based LLMs through learnable projection layers, employing a multi-stage training pipeline.<n>Experiments on the dataset across seven emotion categories demonstrate that ELLM2-EEG-to-Emotion Large Language Model achieves excellent performance on emotion classification.
arXiv Detail & Related papers (2026-01-11T13:21:20Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects [6.503236297532475]
Humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity.<n>We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity.<n>Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models.
arXiv Detail & Related papers (2025-10-08T06:48:08Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization [17.101290138120564]
Current methods rely on dictionary learning with sparse autoencoders (SAEs)<n>Here, we tackle these limitations by directly decomposing activations with semi-nonnegative matrix factorization (SNMF)<n>Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering.
arXiv Detail & Related papers (2025-06-12T17:33:29Z)
How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [64.74977204942199]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z)
Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis [0.5120567378386615]
Interpretability remains a key difficulty in sentiment analysis with Large Language Models (LLMs)<n>This research introduces a technique that applies SHAP (Shapley Additive Explanations) by breaking down LLMs into components such as embedding layer,encoder,decoder and attention layer.<n>The method is evaluated using the Stanford Sentiment Treebank (SST-2) dataset, which shows how different sentences affect different layers.
arXiv Detail & Related papers (2025-03-15T01:37:54Z)
Calibrating Reasoning in Language Models with Internal Consistency [18.24350001344488]
Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks.<n>LLMs often generate text with obvious mistakes and contradictions.<n>In this work, we investigate reasoning in LLMs through the lens of internal representations.
arXiv Detail & Related papers (2024-05-29T02:44:12Z)
Quantifying In-Context Reasoning Effects and Memorization Effects in LLMs [101.51435599249234]
We propose an axiomatic system to define and quantify the precise memorization and in-context reasoning effects used by the large language model (LLM) Specifically, the axiomatic system enables us to categorize the memorization effects into foundational memorization effects and chaotic memorization effects. Experiments show that the clear disentanglement of memorization effects and in-context reasoning effects enables a straightforward examination of detailed inference patterns encoded by LLMs.
arXiv Detail & Related papers (2024-05-20T08:51:03Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models [16.328341121232484]
We apply causal effect estimation strategies to measure the effect of context interventions. We investigate robustness to irrelevant changes and sensitivity to impactful changes of Transformers.
arXiv Detail & Related papers (2024-04-03T10:22:35Z)
Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis [56.84237932819403]
This paper aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization. Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis.
arXiv Detail & Related papers (2022-07-24T03:57:40Z)
Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables. We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph. Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.