Related papers: Explainability for Large Language Models: A Survey

Explainability for Large Language Models: A Survey

URL: http://arxiv.org/abs/2309.01029v3
Date: Tue, 28 Nov 2023 19:04:45 GMT
Title: Explainability for Large Language Models: A Survey
Authors: Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Mengnan Du
Abstract summary: Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
Score: 59.67574757137078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.

Related papers

Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations [5.676319658620339]
How a language model predicts the next token and generates content is not generally understandable by humans.<n>This paper investigates local explainability and mechanistic interpretability within Transformer-based large language models.
arXiv Detail & Related papers (2025-10-20T07:43:53Z)
Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study [11.117380681219295]
We present an automated framework to generate high-quality textual explanations.<n>We rigorously assess the quality of these explanations using a comprehensive suite of Natural Language Generation (NLG) metrics.<n>Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations.
arXiv Detail & Related papers (2025-08-13T12:59:08Z)
MetaExplainer: A Framework to Generate Multi-Type User-Centered Explanations for AI Systems [3.4529606382502904]
We introduce MetaExplainer, a neuro-symbolic framework designed to generate user-centered explanations.<n>Our approach employs a three-stage process: first, we decompose user questions into machine-readable formats using state-of-the-art large language models (LLM); second, we delegate the task of generating system recommendations to model explainer methods; and finally, we synthesize natural language explanations that summarize the explainer outputs.
arXiv Detail & Related papers (2025-08-01T04:01:40Z)
DEAL: Disentangle and Localize Concept-level Explanations for VLMs [10.397502254316645]
Large pre-trained Vision-Language Models might not be able to identify fine-grained concepts. We propose to DisEnt and Localize (Angle) concept-level explanations for concepts without human annotations. Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability.
arXiv Detail & Related papers (2024-07-19T15:39:19Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
From Understanding to Utilization: A Survey on Explainability for Large Language Models [27.295767173801426]
This survey underscores the imperative for increased explainability in Large Language Models (LLMs) Our focus is primarily on pre-trained Transformer-based LLMs, which pose distinctive interpretability challenges due to their scale and complexity. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement.
arXiv Detail & Related papers (2024-01-23T16:09:53Z)
Explanation-aware Soft Ensemble Empowers Large Language Model In-context Learning [50.00090601424348]
Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks. We propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs.
arXiv Detail & Related papers (2023-11-13T06:13:38Z)
Explaining Large Language Model-Based Neural Semantic Parsers (Student Abstract) [0.0]
Large language models (LLMs) have demonstrated strong capability in structured prediction tasks such as semantic parsing. Our work studies different methods for explaining an LLM-based semantic semantic behaviors. We hope to inspire future research toward better understanding them.
arXiv Detail & Related papers (2023-01-25T16:12:43Z)
ExSum: From Local Explanations to Model Understanding [6.23934576145261]
Interpretability methods are developed to understand the working mechanisms of black-box models. Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them. We introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding.
arXiv Detail & Related papers (2022-04-30T02:07:20Z)
Explainability in Process Outcome Prediction: Guidelines to Obtain Interpretable and Faithful Models [77.34726150561087]
We define explainability through the interpretability of the explanations and the faithfulness of the explainability model in the field of process outcome prediction. This paper contributes a set of guidelines named X-MOP which allows selecting the appropriate model based on the event log specifications.
arXiv Detail & Related papers (2022-03-30T05:59:50Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Towards Interpretable Natural Language Understanding with Explanations as Latent Variables [146.83882632854485]
We develop a framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training. Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model.
arXiv Detail & Related papers (2020-10-24T02:05:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.