CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs
- URL: http://arxiv.org/abs/2501.17581v2
- Date: Sun, 09 Feb 2025 17:49:08 GMT
- Title: CSEval: Towards Automated, Multi-Dimensional, and Reference-Free Counterspeech Evaluation using Auto-Calibrated LLMs
- Authors: Amey Hengle, Aswini Kumar, Anil Bandhakavi, Tanmoy Chakraborty,
- Abstract summary: We introduce CSEval, a novel dataset and framework for evaluating counterspeech quality across four dimensions.
We propose Auto-Calibrated COT for Counterspeech Evaluation (Auto-CSEval), a prompt-based method with auto-calibrated chain-of-thoughts.
Our experiments show that Auto-CSEval outperforms traditional metrics like ROUGE, METEOR, and BertScore in correlating with human judgement.
- Score: 18.827745815939213
- License:
- Abstract: Counterspeech has emerged as a popular and effective strategy for combating online hate speech, sparking growing research interest in automating its generation using language models. However, the field still lacks standardised evaluation protocols and reliable automated evaluation metrics that align with human judgement. Current automatic evaluation methods, primarily based on similarity metrics, do not effectively capture the complex and independent attributes of counterspeech quality, such as contextual relevance, aggressiveness, or argumentative coherence. This has led to an increased dependency on labor-intensive human evaluations to assess automated counter-speech generation methods. To address these challenges, we introduce CSEval, a novel dataset and framework for evaluating counterspeech quality across four dimensions: contextual-relevance, aggressiveness, argument-coherence, and suitableness. Furthermore, we propose Auto-Calibrated COT for Counterspeech Evaluation (Auto-CSEval), a prompt-based method with auto-calibrated chain-of-thoughts (CoT) for scoring counterspeech using large language models. Our experiments show that Auto-CSEval outperforms traditional metrics like ROUGE, METEOR, and BertScore in correlating with human judgement, indicating a significant improvement in automated counterspeech evaluation.
Related papers
- How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks.
We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent.
Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - Contextualized Topic Coherence Metrics [6.630482733703617]
We propose methods inspired by standard human topic evaluations, in a family of metrics called Contextualized Topic Coherence ( CTC)
We evaluate CTC relative to five other metrics on six topic models and find that it outperforms automated topic coherence methods.
arXiv Detail & Related papers (2023-05-23T23:53:29Z) - SpeechLMScore: Evaluating speech generation using speech language model [43.20067175503602]
We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model.
It does not require human annotation and is a highly scalable framework.
Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks.
arXiv Detail & Related papers (2022-12-08T21:00:15Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for
Automatic Dialog Evaluation [69.03658685761538]
Open Domain dialog system evaluation is one of the most important challenges in dialog research.
We propose an automatic evaluation model CMADE that automatically cleans self-reported user ratings as it trains on them.
Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
arXiv Detail & Related papers (2020-05-21T15:14:49Z) - Learning an Unreferenced Metric for Online Dialogue Evaluation [53.38078951628143]
We propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances.
We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
arXiv Detail & Related papers (2020-05-01T20:01:39Z) - Designing Precise and Robust Dialogue Response Evaluators [35.137244385158034]
We propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained language models.
Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement.
arXiv Detail & Related papers (2020-04-10T04:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.