ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation
- URL: http://arxiv.org/abs/2405.04818v1
- Date: Wed, 8 May 2024 05:36:52 GMT
- Title: ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation
- Authors: Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui,
- Abstract summary: We present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings.
We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement.
We also measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater.
- Score: 29.718851249656172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: https://github.com/a-brassard/ACORN.
Related papers
- Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.
In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding [3.1674468740431396]
We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct.
We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis.
We show that providing a weaker judge with higher quality references reaches better agreement with human annotators than a stronger judge.
arXiv Detail & Related papers (2025-03-07T00:42:08Z) - Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction.
We find that contextual characteristics significantly affect human reliance behavior.
Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models [32.843361525236965]
We analyze the effect of sparse feedback on the alignment and evaluation of large language models.
We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators.
Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
arXiv Detail & Related papers (2023-08-30T07:35:32Z) - Fine-Grained Human Feedback Gives Better Rewards for Language Model
Training [108.25635150124539]
Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs.
We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects.
arXiv Detail & Related papers (2023-06-02T17:11:37Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - DecipherPref: Analyzing Influential Factors in Human Preference
Judgments via GPT-4 [28.661237196238996]
We conduct an in-depth examination of a collection of pairwise human judgments released by OpenAI.
We find that the most favored factors vary across tasks and genres, whereas the least favored factors tend to be consistent.
Our findings have implications on the construction of balanced datasets in human preference evaluations.
arXiv Detail & Related papers (2023-05-24T04:13:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.