Related papers: Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

URL: http://arxiv.org/abs/2506.12152v1
Date: Fri, 13 Jun 2025 18:13:58 GMT
Title: Because we have LLMs, we Can and Should Pursue Agentic Interpretability
Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord,
Abstract summary: Large Language Models (LLMs) proactively assist human understanding by developing and leveraging a mental model of the user.<n>Agentic interpretability introduces challenges, particularly in evaluation, due to what we call human-entangled-in-the-loop' nature.<n>Agentic interpretability's promise is to help humans learn the potentially deceptive concepts of the LLMs, rather than see us fall increasingly far from understanding them.
Score: 22.10895793309226
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans' mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call `human-entangled-in-the-loop' nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability's promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them.

Related papers

Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation [0.0]
We investigate the theory of mind in Large Language Models (LLMs) through the lens of cooperative multi-agent reinforcement learning (MARL)<n>Our approach aims to enhance artificial agent's ability to adapt and cooperate with both artificial and human partners.
arXiv Detail & Related papers (2025-06-11T02:12:34Z)
How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition [75.11808682808065]
This study investigates whether large language models (LLMs) exhibit similar tendencies in understanding semantic size.<n>Our findings reveal that multi-modal training is crucial for LLMs to achieve more human-like understanding.<n> Lastly, we examine whether LLMs are influenced by attention-grabbing headlines with larger semantic sizes in a real-world web shopping scenario.
arXiv Detail & Related papers (2025-03-01T03:35:56Z)
Non-literal Understanding of Number Words by Language Models [33.24263583093367]
Humans naturally interpret numbers non-literally, combining context, world knowledge, and speaker intent.<n>We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects.
arXiv Detail & Related papers (2025-02-10T07:03:00Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
Should We Fear Large Language Models? A Structural Analysis of the Human Reasoning System for Elucidating LLM Capabilities and Risks Through the Lens of Heidegger's Philosophy [0.0]
This study investigates the capabilities and risks of Large Language Models (LLMs) It uses the innovative parallels between the statistical patterns of word relationships within LLMs and Martin Heidegger's concepts of "ready-to-hand" and "present-at-hand" Our findings reveal that while LLMs possess the capability for Direct Explicative Reasoning and Pseudo Rational Reasoning, they fall short in authentic rational reasoning and have no creative reasoning capabilities.
arXiv Detail & Related papers (2024-03-05T19:40:53Z)
Towards Concept-Aware Large Language Models [56.48016300758356]
Concepts play a pivotal role in various human cognitive functions, including learning, reasoning and communication. There is very little work on endowing machines with the ability to form and reason with concepts. In this work, we analyze how well contemporary large language models (LLMs) capture human concepts and their structure.
arXiv Detail & Related papers (2023-11-03T12:19:22Z)
Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding [1.3654846342364308]
Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text. This position paper critically assesses three points recurring in critiques of LLM capacities. We outline a pragmatic perspective on the issue of real' understanding and intentionality in LLMs.
arXiv Detail & Related papers (2023-10-30T15:51:04Z)
Democratizing Reasoning Ability: Tailored Learning from Large Language Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs. We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z)
Violation of Expectation via Metacognitive Prompting Reduces Theory of Mind Prediction Error in Large Language Models [0.0]
Large Language Models (LLMs) exhibit a compelling level of proficiency in Theory of Mind (ToM) tasks. This ability to impute unobservable mental states to others is vital to human social cognition and may prove equally important in principal-agent relations between humans and Artificial Intelligences (AIs)
arXiv Detail & Related papers (2023-10-10T20:05:13Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.