REV: Information-Theoretic Evaluation of Free-Text Rationales
- URL: http://arxiv.org/abs/2210.04982v5
- Date: Fri, 2 Jun 2023 15:27:46 GMT
- Title: REV: Information-Theoretic Evaluation of Free-Text Rationales
- Authors: Hanjie Chen, Faeze Brahman, Xiang Ren, Yangfeng Ji, Yejin Choi, Swabha
Swayamdipta
- Abstract summary: We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label.
We propose a metric called REV (Rationale Evaluation with conditional V-information) to quantify the amount of new, label-relevant information in a rationale.
- Score: 83.24985872655738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating free-text rationales is a promising step towards explainable NLP,
yet evaluating such rationales remains a challenge. Existing metrics have
mostly focused on measuring the association between the rationale and a given
label. We argue that an ideal metric should focus on the new information
uniquely provided in the rationale that is otherwise not provided in the input
or the label. We investigate this research problem from an
information-theoretic perspective using conditional V-information (Hewitt et
al., 2021). More concretely, we propose a metric called REV (Rationale
Evaluation with conditional V-information), to quantify the amount of new,
label-relevant information in a rationale beyond the information already
available in the input or the label. Experiments across four benchmarks with
reasoning tasks, including chain-of-thought, demonstrate the effectiveness of
REV in evaluating rationale-label pairs, compared to existing metrics. We
further demonstrate REV is consistent with human judgments on rationale
evaluations and provides more sensitive measurements of new information in
free-text rationales. When used alongside traditional performance metrics, REV
provides deeper insights into models' reasoning and prediction processes.
Related papers
- EVA-Score: Evaluating Abstractive Long-form Summarization on Informativeness through Extraction and Validation [24.259369307335774]
EVA-Score is an evaluation metric for abstractive long-form summarization.
We show that EVA-Score shows the highest correlation with humans.
We also re-evaluate the performance of LLMs on long-form summarization from the information perspective.
arXiv Detail & Related papers (2024-07-06T06:02:38Z) - RORA: Robust Free-Text Rationale Evaluation [52.98000150242775]
We propose RORA, a Robust free-text Rationale evaluation against label leakage.
RORA consistently outperforms existing approaches in evaluating human-written, synthetic, or model-generated rationales.
We also show that RORA aligns well with human judgment, providing a more reliable and accurate measurement across diverse free-text rationales.
arXiv Detail & Related papers (2024-02-28T19:46:21Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Measuring Information in Text Explanations [23.929076318334047]
We argue that placing the explanations on an information-theoretic framework could unify the evaluations of two popular text explanation methods.
We quantify the information flow through these channels, thereby facilitating the assessment of explanation characteristics.
Our work contributes to the ongoing efforts in establishing rigorous and standardized evaluation criteria in the rapidly evolving field of explainable AI.
arXiv Detail & Related papers (2023-10-06T19:46:51Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Towards Explainable Evaluation Metrics for Natural Language Generation [36.594817754285984]
We identify key properties and propose key goals of explainable machine translation evaluation metrics.
We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
arXiv Detail & Related papers (2022-03-21T17:05:54Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.