FRAME: Evaluating Simulatability Metrics for Free-Text Rationales
- URL: http://arxiv.org/abs/2207.00779v1
- Date: Sat, 2 Jul 2022 09:25:29 GMT
- Title: FRAME: Evaluating Simulatability Metrics for Free-Text Rationales
- Authors: Aaron Chan, Shaoliang Nie, Liang Tan, Xiaochang Peng, Hamed Firooz,
Maziar Sanjabi, Xiang Ren
- Abstract summary: Free-text rationales aim to explain neural language model (LM) behavior more flexibly and intuitively via natural language.
To ensure rationale quality, it is important to have metrics for measuring rationales' faithfulness and plausibility.
We propose FRAME, a framework for evaluating free-text rationale simulatability metrics.
- Score: 26.58948555913936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Free-text rationales aim to explain neural language model (LM) behavior more
flexibly and intuitively via natural language. To ensure rationale quality, it
is important to have metrics for measuring rationales' faithfulness (reflects
LM's actual behavior) and plausibility (convincing to humans). All existing
free-text rationale metrics are based on simulatability (association between
rationale and LM's predicted label), but there is no protocol for assessing
such metrics' reliability. To investigate this, we propose FRAME, a framework
for evaluating free-text rationale simulatability metrics. FRAME is based on
three axioms: (1) good metrics should yield highest scores for reference
rationales, which maximize rationale-label association by construction; (2)
good metrics should be appropriately sensitive to semantic perturbation of
rationales; and (3) good metrics should be robust to variation in the LM's task
performance. Across three text classification datasets, we show that existing
simulatability metrics cannot satisfy all three FRAME axioms, since they are
implemented via model pretraining which muddles the metric's signal. We
introduce a non-pretraining simulatability variant that improves performance on
(1) and (3) by an average of 41.7% and 42.9%, respectively, while performing
competitively on (2).
Related papers
- Tailoring Self-Rationalizers with Multi-Reward Distillation [88.95781098418993]
Large language models (LMs) are capable of generating free-text rationales to aid question answering.
In this work, we enable small-scale LMs to generate rationales that improve downstream task performance.
Our method, MaRio, is a multi-reward conditioned self-rationalization algorithm.
arXiv Detail & Related papers (2023-11-06T00:20:11Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Measuring Association Between Labels and Free-Text Rationales [60.58672852655487]
In interpretable NLP, we require faithful rationales that reflect the model's decision-making process for an explained instance.
We demonstrate that pipelines, existing models for faithful extractive rationalization on information-extraction style tasks, do not extend as reliably to "reasoning" tasks requiring free-text rationales.
We turn to models that jointly predict and rationalize, a class of widely used high-performance models for free-text rationalization whose faithfulness is not yet established.
arXiv Detail & Related papers (2020-10-24T03:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.