On the Challenges of Evaluating Compositional Explanations in Multi-Hop
Inference: Relevance, Completeness, and Expert Ratings
- URL: http://arxiv.org/abs/2109.03334v1
- Date: Tue, 7 Sep 2021 21:00:05 GMT
- Title: On the Challenges of Evaluating Compositional Explanations in Multi-Hop
Inference: Relevance, Completeness, and Expert Ratings
- Authors: Peter Jansen, Kelly Smith, Dan Moreno and Huitzilin Ortiz
- Abstract summary: Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct.
In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of facts, as well as the completeness of model-generated explanations.
We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36% when compared with full manual expert judgements.
- Score: 1.7243339961137647
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Building compositional explanations requires models to combine two or more
facts that, together, describe why the answer to a question is correct.
Typically, these "multi-hop" explanations are evaluated relative to one (or a
small number of) gold explanations. In this work, we show these evaluations
substantially underestimate model performance, both in terms of the relevance
of included facts, as well as the completeness of model-generated explanations,
because models regularly discover and produce valid explanations that are
different than gold explanations. To address this, we construct a large corpus
of 126k domain-expert (science teacher) relevance ratings that augment a corpus
of explanations to standardized science exam questions, discovering 80k
additional relevant facts not rated as gold. We build three strong models based
on different methodologies (generation, ranking, and schemas), and empirically
show that while expert-augmented ratings provide better estimates of
explanation quality, both original (gold) and expert-augmented automatic
evaluations still substantially underestimate performance by up to 36% when
compared with full manual expert judgements, with different models being
disproportionately affected. This poses a significant methodological challenge
to accurately evaluating explanations produced by compositional reasoning
models.
Related papers
- Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance.
Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate.
This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z) - CNN-based explanation ensembling for dataset, representation and explanations evaluation [1.1060425537315088]
We explore the potential of ensembling explanations generated by deep classification models using convolutional model.
Through experimentation and analysis, we aim to investigate the implications of combining explanations to uncover a more coherent and reliable patterns of the model's behavior.
arXiv Detail & Related papers (2024-04-16T08:39:29Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - OPT-R: Exploring the Role of Explanations in Finetuning and Prompting
for Reasoning Skills of Large Language Models [48.412284346337344]
We conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs)
Our study entails finetuning three different sizes of Open Pretrained Transformers (OPT)
We then evaluate all models on 57 out-of-domain tasks drawn from the SUPER-NATURALINSTRUCTIONS benchmark.
arXiv Detail & Related papers (2023-05-19T20:58:22Z) - MetaLogic: Logical Reasoning Explanations with Fine-Grained Structure [129.8481568648651]
We propose a benchmark to investigate models' logical reasoning capabilities in complex real-life scenarios.
Based on the multi-hop chain of reasoning, the explanation form includes three main components.
We evaluate the current best models' performance on this new explanation form.
arXiv Detail & Related papers (2022-10-22T16:01:13Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z) - ExSum: From Local Explanations to Model Understanding [6.23934576145261]
Interpretability methods are developed to understand the working mechanisms of black-box models.
Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them.
We introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding.
arXiv Detail & Related papers (2022-04-30T02:07:20Z) - Detection Accuracy for Evaluating Compositional Explanations of Units [5.220940151628734]
Two examples of methods that use this approach are Network Dissection and Compositional explanations.
While intuitively, logical forms are more informative than atomic concepts, it is not clear how to quantify this improvement.
We propose to use as evaluation metric the Detection Accuracy, which measures units' consistency of detection of their assigned explanations.
arXiv Detail & Related papers (2021-09-16T08:47:34Z) - Evaluating Explanations: How much do explanations from the teacher aid
students? [103.05037537415811]
We formalize the value of explanations using a student-teacher paradigm that measures the extent to which explanations improve student models in learning.
Unlike many prior proposals to evaluate explanations, our approach cannot be easily gamed, enabling principled, scalable, and automatic evaluation of attributions.
arXiv Detail & Related papers (2020-12-01T23:40:21Z) - The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal
Sufficient Subsets [61.66584140190247]
We show that feature-based explanations pose problems even for explaining trivial models.
We show that two popular classes of explainers, Shapley explainers and minimal sufficient subsets explainers, target fundamentally different types of ground-truth explanations.
arXiv Detail & Related papers (2020-09-23T09:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.