An Interpretability Evaluation Benchmark for Pre-trained Language Models
- URL: http://arxiv.org/abs/2207.13948v1
- Date: Thu, 28 Jul 2022 08:28:09 GMT
- Title: An Interpretability Evaluation Benchmark for Pre-trained Language Models
- Authors: Yaozong Shen, Lijie Wang, Ying Chen, Xinyan Xiao, Jing Liu, Hua Wu
- Abstract summary: We propose a novel evaluation benchmark providing with both English and Chinese annotated data.
It tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation.
It contains perturbed instances for each original instance, so as to use the rationale consistency under perturbations as the metric for faithfulness.
- Score: 37.16893581395874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While pre-trained language models (LMs) have brought great improvements in
many NLP tasks, there is increasing attention to explore capabilities of LMs
and interpret their predictions. However, existing works usually focus only on
a certain capability with some downstream tasks. There is a lack of datasets
for directly evaluating the masked word prediction performance and the
interpretability of pre-trained LMs. To fill in the gap, we propose a novel
evaluation benchmark providing with both English and Chinese annotated data. It
tests LMs abilities in multiple dimensions, i.e., grammar, semantics,
knowledge, reasoning and computation. In addition, it provides carefully
annotated token-level rationales that satisfy sufficiency and compactness. It
contains perturbed instances for each original instance, so as to use the
rationale consistency under perturbations as the metric for faithfulness, a
perspective of interpretability. We conduct experiments on several widely-used
pre-trained LMs. The results show that they perform very poorly on the
dimensions of knowledge and computation. And their plausibility in all
dimensions is far from satisfactory, especially when the rationale is short. In
addition, the pre-trained LMs we evaluated are not robust on syntax-aware data.
We will release this evaluation benchmark at \url{http://xyz}, and hope it can
facilitate the research progress of pre-trained LMs.
Related papers
- FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval [51.437420003471615]
We propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch.
RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
arXiv Detail & Related papers (2023-06-23T10:18:02Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Evidence > Intuition: Transferability Estimation for Encoder Selection [16.490047604583882]
We generate quantitative evidence to predict which LM will perform best on a target task without having to fine-tune all candidates.
We adopt the state-of-the-art Logarithm Maximum of Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of setups.
arXiv Detail & Related papers (2022-10-20T13:25:21Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Evaluating Document Coherence Modelling [37.287725949616934]
We examine the performance of a broad range of pretrained LMs on a sentence intrusion detection task for English.
Our experiments show that pretrained LMs perform impressively in in-domain evaluation, but experience a substantial drop in the cross-domain setting.
arXiv Detail & Related papers (2021-03-18T10:05:06Z) - oLMpics -- On what Language Model Pre-training Captures [84.60594612120173]
We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition.
A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
arXiv Detail & Related papers (2019-12-31T12:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.