Evaluating and Characterizing Human Rationales
- URL: http://arxiv.org/abs/2010.04736v1
- Date: Fri, 9 Oct 2020 18:00:04 GMT
- Title: Evaluating and Characterizing Human Rationales
- Authors: Samuel Carton, Anirudh Rathore, Chenhao Tan
- Abstract summary: We find that human rationales do not necessarily perform well on automated metrics.
We propose improved metrics to account for model-dependent baseline performance.
Our work leads to actionable suggestions for evaluating and characterizing rationales.
- Score: 12.678505281794715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Two main approaches for evaluating the quality of machine-generated
rationales are: 1) using human rationales as a gold standard; and 2) automated
metrics based on how rationales affect model behavior. An open question,
however, is how human rationales fare with these automatic metrics. Analyzing a
variety of datasets and models, we find that human rationales do not
necessarily perform well on these metrics. To unpack this finding, we propose
improved metrics to account for model-dependent baseline performance. We then
propose two methods to further characterize rationale quality, one based on
model retraining and one on using "fidelity curves" to reveal properties such
as irrelevance and redundancy. Our work leads to actionable suggestions for
evaluating and characterizing rationales.
Related papers
- QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Are Machine Rationales (Not) Useful to Humans? Measuring and Improving
Human Utility of Free-Text Rationales [62.02328001381361]
We show that human utility of existing rationales is far from satisfactory, and expensive to estimate with human studies.
We translate this finding into an automated score, GEN-U, that can help improve LMs' ability to generate rationales with better human utility.
arXiv Detail & Related papers (2023-05-11T19:01:13Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - Does Self-Rationalization Improve Robustness to Spurious Correlations? [19.553357015260687]
We ask whether training models to self-rationalize can aid in their learning to solve tasks for the right reasons.
We evaluate robustness to spurious correlations in fine-tuned encoder-decoder and decoder-only models of six different sizes.
We find that while self-rationalization can improve robustness to spurious correlations in low-resource settings, it tends to hurt robustness in higher-resource settings.
arXiv Detail & Related papers (2022-10-24T19:54:57Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Using Shape Metrics to Describe 2D Data Points [0.0]
We propose to use shape metrics to describe 2D data to help make analyses more explainable and interpretable.
This is particularly important in applications in the medical community where the right to explainability' is crucial.
arXiv Detail & Related papers (2022-01-27T23:28:42Z) - What to Learn, and How: Toward Effective Learning from Rationales [10.287185780246247]
Learning from rationales seeks to augment model training with human-provided rationales that justify those labels.
Our work highlights the importance of understanding properties of human explanations and exploiting them accordingly in model training.
arXiv Detail & Related papers (2021-11-30T20:09:53Z) - Is Automated Topic Model Evaluation Broken?: The Incoherence of
Coherence [62.826466543958624]
We look at the standardization gap and the validation gap in topic model evaluation.
Recent models relying on neural components surpass classical topic models according to these metrics.
We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion.
arXiv Detail & Related papers (2021-07-05T17:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.