A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models
- URL: http://arxiv.org/abs/2402.11676v2
- Date: Fri, 29 Mar 2024 15:01:38 GMT
- Title: A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models
- Authors: Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun,
- Abstract summary: Counter narratives are informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters.
Previous automatic metrics for counter narrative evaluation lack alignment with human judgment.
We propose a novel evaluation framework prompting LLMs to provide scores and feedback for generated counter narrative candidates.
- Score: 16.878541623617473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. While previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. Previous automatic metrics for counter narrative evaluation lack alignment with human judgment as they rely on superficial reference comparisons instead of incorporating key aspects of counter narrative quality as evaluation criteria. To address prior evaluation limitations, we propose a novel evaluation framework prompting LLMs to provide scores and feedback for generated counter narrative candidates using 5 defined aspects derived from guidelines from counter narrative specialized NGOs. We found that LLM evaluators achieve strong alignment to human-annotated scores and feedback and outperform alternative metrics, indicating their potential as multi-aspect, reference-free and interpretable evaluators for counter narrative evaluation.
Related papers
- Measuring the Robustness of Reference-Free Dialogue Evaluation Systems [12.332146893333952]
We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks.
We analyze metrics such as DialogRPT, UniEval, and PromptEval across grounded and ungrounded datasets.
arXiv Detail & Related papers (2025-01-12T06:41:52Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.
We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - Contextualized Counterspeech: Strategies for Adaptation, Personalization, and Evaluation [2.1944577276732726]
We propose and evaluate strategies for generating tailored counterspeech that is adapted to the moderation context and personalized for the moderated user.
Results show that contextualized counterspeech can significantly outperform state-of-the-art generic counterspeech in adequacy and persuasiveness.
The effectiveness of contextualized AI-generated counterspeech and the divergence between human and algorithmic evaluations underscore the importance of increased human-AI collaboration in content moderation.
arXiv Detail & Related papers (2024-12-10T09:29:52Z) - RevisEval: Improving LLM-as-a-Judge via Response-Adapted References [95.29800580588592]
RevisEval is a novel text generation evaluation paradigm via the response-adapted references.
RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated.
arXiv Detail & Related papers (2024-10-07T16:50:47Z) - A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation [14.064465097974836]
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator.
We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception.
arXiv Detail & Related papers (2024-06-21T15:11:33Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.