Interpreting Pretrained Language Models via Concept Bottlenecks
- URL: http://arxiv.org/abs/2311.05014v1
- Date: Wed, 8 Nov 2023 20:41:18 GMT
- Title: Interpreting Pretrained Language Models via Concept Bottlenecks
- Authors: Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong Li and Huan Liu
- Abstract summary: Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
- Score: 55.47515772358389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained language models (PLMs) have made significant strides in various
natural language processing tasks. However, the lack of interpretability due to
their ``black-box'' nature poses challenges for responsible implementation.
Although previous studies have attempted to improve interpretability by using,
e.g., attention weights in self-attention layers, these weights often lack
clarity, readability, and intuitiveness. In this research, we propose a novel
approach to interpreting PLMs by employing high-level, meaningful concepts that
are easily understandable for humans. For example, we learn the concept of
``Food'' and investigate how it influences the prediction of a model's
sentiment towards a restaurant review. We introduce C$^3$M, which combines
human-annotated and machine-generated concepts to extract hidden neurons
designed to encapsulate semantically meaningful and task-specific concepts.
Through empirical evaluations on real-world datasets, we manifest that our
approach offers valuable insights to interpret PLM behavior, helps diagnose
model failures, and enhances model robustness amidst noisy concept labels.
Related papers
- Self-supervised Interpretable Concept-based Models for Text Classification [9.340843984411137]
This paper proposes a self-supervised Interpretable Concept Embedding Models (ICEMs)
We leverage the generalization abilities of Large-Language Models to predict the concepts labels in a self-supervised way.
ICEMs can be trained in a self-supervised way achieving similar performance to fully supervised concept-based models and end-to-end black-box ones.
arXiv Detail & Related papers (2024-06-20T14:04:53Z) - ConcEPT: Concept-Enhanced Pre-Training for Language Models [57.778895980999124]
ConcEPT aims to infuse conceptual knowledge into pre-trained language models.
It exploits external entity concept prediction to predict the concepts of entities mentioned in the pre-training contexts.
Results of experiments show that ConcEPT gains improved conceptual knowledge with concept-enhanced pre-training.
arXiv Detail & Related papers (2024-01-11T05:05:01Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - Uncovering Unique Concept Vectors through Latent Space Decomposition [0.0]
Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates.
We propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training.
Our experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand.
arXiv Detail & Related papers (2023-07-13T17:21:54Z) - Can LLMs facilitate interpretation of pre-trained language models? [18.77022630961142]
We propose using a large language model, ChatGPT, as an annotator to enable fine-grained interpretation analysis of pre-trained language models.
We discover latent concepts within pre-trained language models by applying agglomerative hierarchical clustering over contextualized representations.
Our findings demonstrate that ChatGPT produces accurate and semantically richer annotations compared to human-annotated concepts.
arXiv Detail & Related papers (2023-05-22T18:03:13Z) - GlanceNets: Interpretabile, Leak-proof Concept-based Models [23.7625973884849]
Concept-based models (CBMs) combine high-performance and interpretability by acquiring and reasoning with a vocabulary of high-level concepts.
We provide a clear definition of interpretability in terms of alignment between the model's representation and an underlying data generation process.
We introduce GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment.
arXiv Detail & Related papers (2022-05-31T08:53:53Z) - A Survey of Knowledge Enhanced Pre-trained Models [28.160826399552462]
We refer to pre-trained language models with knowledge injection as knowledge-enhanced pre-trained language models (KEPLMs)
These models demonstrate deep understanding and logical reasoning and introduce interpretability.
arXiv Detail & Related papers (2021-10-01T08:51:58Z) - A Minimalist Dataset for Systematic Generalization of Perception,
Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts.
In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images.
We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z) - Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions.
We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.