Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI
- URL: http://arxiv.org/abs/2511.00230v1
- Date: Fri, 31 Oct 2025 20:03:52 GMT
- Title: Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI
- Authors: Sheer Karny, Anthony Baez, Pat Pataranutaporn,
- Abstract summary: We introduce an interface that enables neural transparency by exposing language model internals during chatbots design.<n>Our approach extracts behavioral trait vectors by computing differences in neural activations between contrastive system prompts that elicit opposing behaviors.<n>This work offers a path for how interpretability can be operationalized for non-technical users, establishing a foundation for safer, more aligned human-AI interactions.
- Score: 9.383958408772694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Millions of users now design personalized LLM-based chatbots that shape their daily interactions, yet they can only loosely anticipate how their design choices will manifest as behaviors in deployment. This opacity is consequential: seemingly innocuous prompts can trigger excessive sycophancy, toxicity, or inconsistency, degrading utility and raising safety concerns. To address this issue, we introduce an interface that enables neural transparency by exposing language model internals during chatbot design. Our approach extracts behavioral trait vectors (empathy, toxicity, sycophancy, etc.) by computing differences in neural activations between contrastive system prompts that elicit opposing behaviors. We predict chatbot behaviors by projecting the system prompt's final token activations onto these trait vectors, normalizing for cross-trait comparability, and visualizing results via an interactive sunburst diagram. To evaluate this approach, we conducted an online user study using Prolific to compare our neural transparency interface against a baseline chatbot interface without any form of transparency. Our analyses suggest that users systematically miscalibrated AI behavior: participants misjudged trait activations for eleven of fifteen analyzable traits, motivating the need for transparency tools in everyday human-AI interaction. While our interface did not change design iteration patterns, it significantly increased user trust and was enthusiastically received. Qualitative analysis indicated that users' had nuanced experiences with the visualization that may enrich future work designing neurally transparent interfaces. This work offers a path for how mechanistic interpretability can be operationalized for non-technical users, establishing a foundation for safer, more aligned human-AI interactions.
Related papers
- AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support [0.514825619161626]
Current AI systems remain largely passive due to an overreliance on explainability-centric designs.<n> transitioning AI to an active teammate requires adaptive, context-aware interactions.
arXiv Detail & Related papers (2026-01-26T19:18:50Z) - 8bit-GPT: Exploring Human-AI Interaction on Obsolete Macintosh Operating Systems [0.8122270502556375]
8bit-GPT is a language model simulated on a legacy Macintosh Operating System.<n>This work aims to foreground the presence of chatbots as a tool by defamiliarizing the interface and prioritizing inefficient interaction.
arXiv Detail & Related papers (2025-11-07T06:56:04Z) - Evaluating Node-tree Interfaces for AI Explainability [0.5437050212139087]
This study evaluates user experiences with two distinct AI interfaces - node-tree interfaces and chatbots.<n>Our node-tree interface visually structures AI-generated responses into hierarchically organized, interactive nodes.<n>Our findings suggest that AI interfaces capable of switching between structured visualizations and conversational formats can significantly enhance transparency and user confidence in AI-powered systems.
arXiv Detail & Related papers (2025-10-07T20:48:08Z) - Dark Patterns Meet GUI Agents: LLM Agent Susceptibility to Manipulative Interfaces and the Role of Human Oversight [51.53020962098759]
This study examines how agents, human participants, and human-AI teams respond to 16 types of dark patterns across diverse scenarios.<n>Phase 1 highlights that agents often fail to recognize dark patterns, and even when aware, prioritize task completion over protective action.<n>Phase 2 revealed divergent failure modes: humans succumb due to cognitive shortcuts and habitual compliance, while agents falter from procedural blind spots.
arXiv Detail & Related papers (2025-09-12T22:26:31Z) - Interpretability as Alignment: Making Internal Understanding a Design Principle [3.6704226968275253]
Interpretability provides a route to internal transparency by revealing the computations that drive outputs.<n>We argue that interpretability especially mechanistic approaches should be treated as a design principle for alignment, not an auxiliary diagnostic tool.
arXiv Detail & Related papers (2025-09-10T13:45:59Z) - Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions.<n>Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues.<n>We propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics.
arXiv Detail & Related papers (2025-07-23T12:30:19Z) - Visual Agents as Fast and Slow Thinkers [88.1404921693082]
We introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents.<n>FaST employs a switch adapter to dynamically select between System 1/2 modes.<n>It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data.
arXiv Detail & Related papers (2024-08-16T17:44:02Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Occlusion-Aware Crowd Navigation Using People as Sensors [8.635930195821263]
Occlusions are highly prevalent in such settings due to a limited sensor field of view.
Previous work has shown that observed interactive behaviors of human agents can be used to estimate potential obstacles.
We propose integrating such social inference techniques into the planning pipeline.
arXiv Detail & Related papers (2022-10-02T15:18:32Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - Affect-Aware Deep Belief Network Representations for Multimodal
Unsupervised Deception Detection [3.04585143845864]
unsupervised approach for detecting real-world, high-stakes deception in videos without requiring labels.
This paper presents our novel approach for affect-aware unsupervised Deep Belief Networks (DBN)
In addition to using facial affect as a feature on which DBN models are trained, we also introduce a DBN training procedure that uses facial affect as an aligner of audio-visual representations.
arXiv Detail & Related papers (2021-08-17T22:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.