Explainability for Large Language Models: A Survey
- URL: http://arxiv.org/abs/2309.01029v3
- Date: Tue, 28 Nov 2023 19:04:45 GMT
- Title: Explainability for Large Language Models: A Survey
- Authors: Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi
Cai, Shuaiqiang Wang, Dawei Yin, Mengnan Du
- Abstract summary: Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
- Score: 59.67574757137078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive capabilities in
natural language processing. However, their internal mechanisms are still
unclear and this lack of transparency poses unwanted risks for downstream
applications. Therefore, understanding and explaining these models is crucial
for elucidating their behaviors, limitations, and social impacts. In this
paper, we introduce a taxonomy of explainability techniques and provide a
structured overview of methods for explaining Transformer-based language
models. We categorize techniques based on the training paradigms of LLMs:
traditional fine-tuning-based paradigm and prompting-based paradigm. For each
paradigm, we summarize the goals and dominant approaches for generating local
explanations of individual predictions and global explanations of overall model
knowledge. We also discuss metrics for evaluating generated explanations, and
discuss how explanations can be leveraged to debug models and improve
performance. Lastly, we examine key challenges and emerging opportunities for
explanation techniques in the era of LLMs in comparison to conventional machine
learning models.
Related papers
- DEAL: Disentangle and Localize Concept-level Explanations for VLMs [10.397502254316645]
Large pre-trained Vision-Language Models might not be able to identify fine-grained concepts.
We propose to DisEnt and Localize (Angle) concept-level explanations for concepts without human annotations.
Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability.
arXiv Detail & Related papers (2024-07-19T15:39:19Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - From Understanding to Utilization: A Survey on Explainability for Large
Language Models [27.295767173801426]
This survey underscores the imperative for increased explainability in Large Language Models (LLMs)
Our focus is primarily on pre-trained Transformer-based LLMs, which pose distinctive interpretability challenges due to their scale and complexity.
When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement.
arXiv Detail & Related papers (2024-01-23T16:09:53Z) - Explanation-aware Soft Ensemble Empowers Large Language Model In-context
Learning [50.00090601424348]
Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks.
We propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs.
arXiv Detail & Related papers (2023-11-13T06:13:38Z) - Explaining Large Language Model-Based Neural Semantic Parsers (Student
Abstract) [0.0]
Large language models (LLMs) have demonstrated strong capability in structured prediction tasks such as semantic parsing.
Our work studies different methods for explaining an LLM-based semantic semantic behaviors.
We hope to inspire future research toward better understanding them.
arXiv Detail & Related papers (2023-01-25T16:12:43Z) - ExSum: From Local Explanations to Model Understanding [6.23934576145261]
Interpretability methods are developed to understand the working mechanisms of black-box models.
Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them.
We introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding.
arXiv Detail & Related papers (2022-04-30T02:07:20Z) - Explainability in Process Outcome Prediction: Guidelines to Obtain
Interpretable and Faithful Models [77.34726150561087]
We define explainability through the interpretability of the explanations and the faithfulness of the explainability model in the field of process outcome prediction.
This paper contributes a set of guidelines named X-MOP which allows selecting the appropriate model based on the event log specifications.
arXiv Detail & Related papers (2022-03-30T05:59:50Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Towards Interpretable Natural Language Understanding with Explanations
as Latent Variables [146.83882632854485]
We develop a framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training.
Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model.
arXiv Detail & Related papers (2020-10-24T02:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.