FRAME: Evaluating Simulatability Metrics for Free-Text Rationales
- URL: http://arxiv.org/abs/2207.00779v1
- Date: Sat, 2 Jul 2022 09:25:29 GMT
- Title: FRAME: Evaluating Simulatability Metrics for Free-Text Rationales
- Authors: Aaron Chan, Shaoliang Nie, Liang Tan, Xiaochang Peng, Hamed Firooz,
Maziar Sanjabi, Xiang Ren
- Abstract summary: Free-text rationales aim to explain neural language model (LM) behavior more flexibly and intuitively via natural language.
To ensure rationale quality, it is important to have metrics for measuring rationales' faithfulness and plausibility.
We propose FRAME, a framework for evaluating free-text rationale simulatability metrics.
- Score: 26.58948555913936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Free-text rationales aim to explain neural language model (LM) behavior more
flexibly and intuitively via natural language. To ensure rationale quality, it
is important to have metrics for measuring rationales' faithfulness (reflects
LM's actual behavior) and plausibility (convincing to humans). All existing
free-text rationale metrics are based on simulatability (association between
rationale and LM's predicted label), but there is no protocol for assessing
such metrics' reliability. To investigate this, we propose FRAME, a framework
for evaluating free-text rationale simulatability metrics. FRAME is based on
three axioms: (1) good metrics should yield highest scores for reference
rationales, which maximize rationale-label association by construction; (2)
good metrics should be appropriately sensitive to semantic perturbation of
rationales; and (3) good metrics should be robust to variation in the LM's task
performance. Across three text classification datasets, we show that existing
simulatability metrics cannot satisfy all three FRAME axioms, since they are
implemented via model pretraining which muddles the metric's signal. We
introduce a non-pretraining simulatability variant that improves performance on
(1) and (3) by an average of 41.7% and 42.9%, respectively, while performing
competitively on (2).
Related papers
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models [50.15455336684986]
We evaluate the effectiveness of LogProbs and basic prompting to measure semantic plausibility.
We find that LogProbs offers a more reliable measure of semantic plausibility than direct zero-shot prompting.
We conclude that, even in the era of prompt-based evaluations, LogProbs constitute a useful metric of semantic plausibility.
arXiv Detail & Related papers (2024-03-21T22:08:44Z) - Tailoring Self-Rationalizers with Multi-Reward Distillation [88.95781098418993]
Large language models (LMs) are capable of generating free-text rationales to aid question answering.
In this work, we enable small-scale LMs to generate rationales that improve downstream task performance.
Our method, MaRio, is a multi-reward conditioned self-rationalization algorithm.
arXiv Detail & Related papers (2023-11-06T00:20:11Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Measuring Association Between Labels and Free-Text Rationales [60.58672852655487]
In interpretable NLP, we require faithful rationales that reflect the model's decision-making process for an explained instance.
We demonstrate that pipelines, existing models for faithful extractive rationalization on information-extraction style tasks, do not extend as reliably to "reasoning" tasks requiring free-text rationales.
We turn to models that jointly predict and rationalize, a class of widely used high-performance models for free-text rationalization whose faithfulness is not yet established.
arXiv Detail & Related papers (2020-10-24T03:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.