Related papers: Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

URL: http://arxiv.org/abs/2406.12800v1
Date: Tue, 18 Jun 2024 17:12:50 GMT
Title: Supporting Human Raters with the Detection of Harmful Content using Large Language Models
Authors: Kurt Thomas, Patrick Gage Kelley, David Tao, Sarah Meiklejohn, Owen Vallis, Shunwen Tan, Blaž Bratanič, Felipe Tiengo Ferreira, Vijay Kumar Eranti, Elie Bursztein,
Abstract summary: We demonstrate that large language models (LLMs) can achieve 90% accuracy when compared to human verdicts. We propose five design patterns that integrate LLMs with human rating. We share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity.
Score: 8.580258386804282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election misinformation. Using a dataset of 50,000 comments, we demonstrate that LLMs can achieve 90% accuracy when compared to human verdicts. We explore how to best leverage these capabilities, proposing five design patterns that integrate LLMs with human rating, such as pre-filtering non-violative content, detecting potential errors in human rating, or surfacing critical context to support human rating. We outline how to support all of these design patterns using a single, optimized prompt. Beyond these synthetic experiments, we share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity, and a 9--11% increase (absolute) in precision and recall for detecting violative content.

Related papers

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code [0.0]
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection.
arXiv Detail & Related papers (2025-04-23T10:05:27Z)
Accelerating Unbiased LLM Evaluation via Synthetic Feedback [17.597195550638343]
We propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant.
arXiv Detail & Related papers (2025-02-14T21:27:09Z)
Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset [6.106100820330045]
This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform. We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
arXiv Detail & Related papers (2025-02-01T23:01:39Z)
Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models [36.84880190385986]
Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. We propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets.
arXiv Detail & Related papers (2024-10-14T05:18:07Z)
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models. We introduce VLFeedback, the first large-scale vision-language feedback dataset. We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z)
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing. We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences. Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Large Language Models for Automatic Detection of Sensitive Topics [20.929598260734995]
Large language models (LLMs) are known for their capability to understand and process natural language. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain. The best-performing model, GPT-4o, achieved an average accuracy of 99.5% and an F1-score of 0.99.
arXiv Detail & Related papers (2024-09-02T04:50:42Z)
Pistis-RAG: Enhancing Retrieval-Augmented Generation with Human Feedback [41.88662700261036]
RAG systems face limitations when semantic relevance alone does not guarantee improved generation quality. We propose Pistis-RAG, a new RAG framework designed with a content-centric approach to better align LLMs with human preferences.
arXiv Detail & Related papers (2024-06-21T08:52:11Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language [31.0723480021355]
We investigate data efficiency of modeling human feedback that is in natural language. We fine-tune an open-source LLM, e.g., Falcon-40B-Instruct, on a relatively small amount of human feedback in natural language. We show that this model is able to improve the quality of responses from even some of the strongest LLMs.
arXiv Detail & Related papers (2023-11-24T15:20:36Z)
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities. Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics. Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z)
UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z)
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions. This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision. We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.