Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset
- URL: http://arxiv.org/abs/2502.01676v1
- Date: Sat, 01 Feb 2025 23:01:39 GMT
- Title: Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset
- Authors: Man Luo, Bradley Peterson, Rafael Gan, Hari Ramalingame, Navya Gangrade, Ariadne Dimarogona, Imon Banerjee, Phillip Howard,
- Abstract summary: This work explores an important but underexplored area: detecting toxicity in peer reviews.
We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform.
We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
- Score: 6.106100820330045
- License:
- Abstract: Peer review is crucial for advancing and improving science through constructive criticism. However, toxic feedback can discourage authors and hinder scientific progress. This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform, annotated by human experts according to these definitions. Leveraging this dataset, we benchmark a variety of models, including a dedicated toxicity detection model, a sentiment analysis model, several open-source large language models (LLMs), and two closed-source LLMs. Our experiments explore the impact of different prompt granularities, from coarse to fine-grained instructions, on model performance. Notably, state-of-the-art LLMs like GPT-4 exhibit low alignment with human judgments under simple prompts but achieve improved alignment with detailed instructions. Moreover, the model's confidence score is a good indicator of better alignment with human judgments. For example, GPT-4 achieves a Cohen's Kappa score of 0.56 with human judgments, which increases to 0.63 when using only predictions with a confidence score higher than 95%. Overall, our dataset and benchmarks underscore the need for continued research to enhance toxicity detection capabilities of LLMs. By addressing this issue, our work aims to contribute to a healthy and responsible environment for constructive academic discourse and scientific collaboration.
Related papers
- Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.
We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.
The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z) - Poor-Supervised Evaluation for SuperLLM via Mutual Consistency [20.138831477848615]
We propose the PoEM framework to conduct evaluation without accurate labels.
We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model.
To alleviate the insufficiencies of the conditions in reality, we introduce an algorithm that treats humans (when available) and the models under evaluation as reference models.
arXiv Detail & Related papers (2024-08-25T06:49:03Z) - Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models [21.280725490520798]
This paper focuses on the ability of large language models to verify public health claims.
We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models.
arXiv Detail & Related papers (2024-05-15T15:49:06Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - ROBBIE: Robust Bias Evaluation of Large Generative Language Models [27.864027322486375]
Different prompt-based datasets can be used to measure social bias across multiple text domains and demographic axes.
We compare 6 different prompt-based bias and toxicity metrics across 12 demographic axes and 5 families of generative LLMs.
We conduct a comprehensive study of how well 3 bias/toxicity mitigation techniques perform across our suite of measurements.
arXiv Detail & Related papers (2023-11-29T23:03:04Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - UTNLP at SemEval-2021 Task 5: A Comparative Analysis of Toxic Span
Detection using Attention-based, Named Entity Recognition, and Ensemble
Models [6.562256987706127]
This paper presents our team's, UTNLP, methodology and results in the SemEval-2021 shared task 5 on toxic spans detection.
The experiments start with keyword-based models and are followed by attention-based, named entity-based, transformers-based, and ensemble models.
Our best approach, an ensemble model, achieves an F1 of 0.684 in the competition's evaluation phase.
arXiv Detail & Related papers (2021-04-10T13:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.