Supporting Human Raters with the Detection of Harmful Content using Large Language Models
- URL: http://arxiv.org/abs/2406.12800v1
- Date: Tue, 18 Jun 2024 17:12:50 GMT
- Title: Supporting Human Raters with the Detection of Harmful Content using Large Language Models
- Authors: Kurt Thomas, Patrick Gage Kelley, David Tao, Sarah Meiklejohn, Owen Vallis, Shunwen Tan, Blaž Bratanič, Felipe Tiengo Ferreira, Vijay Kumar Eranti, Elie Bursztein,
- Abstract summary: We demonstrate that large language models (LLMs) can achieve 90% accuracy when compared to human verdicts.
We propose five design patterns that integrate LLMs with human rating.
We share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity.
- Score: 8.580258386804282
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we explore the feasibility of leveraging large language models (LLMs) to automate or otherwise assist human raters with identifying harmful content including hate speech, harassment, violent extremism, and election misinformation. Using a dataset of 50,000 comments, we demonstrate that LLMs can achieve 90% accuracy when compared to human verdicts. We explore how to best leverage these capabilities, proposing five design patterns that integrate LLMs with human rating, such as pre-filtering non-violative content, detecting potential errors in human rating, or surfacing critical context to support human rating. We outline how to support all of these design patterns using a single, optimized prompt. Beyond these synthetic experiments, we share how piloting our proposed techniques in a real-world review queue yielded a 41.5% improvement in optimizing available human rater capacity, and a 9--11% increase (absolute) in precision and recall for detecting violative content.
Related papers
- Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models [36.84880190385986]
Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions.
However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets.
We propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets.
arXiv Detail & Related papers (2024-10-14T05:18:07Z) - VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment [55.7956150385255]
We investigate the efficacy of AI feedback to scale supervision for aligning vision-language models.
We introduce VLFeedback, the first large-scale vision-language feedback dataset.
We train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback.
arXiv Detail & Related papers (2024-10-12T07:56:47Z) - STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing.
We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences.
Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Large Language Models for Automatic Detection of Sensitive Topics [20.929598260734995]
Large language models (LLMs) are known for their capability to understand and process natural language.
This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain.
The best-performing model, GPT-4o, achieved an average accuracy of 99.5% and an F1-score of 0.99.
arXiv Detail & Related papers (2024-09-02T04:50:42Z) - Pistis-RAG: Enhancing Retrieval-Augmented Generation with Human Feedback [41.88662700261036]
RAG systems face limitations when semantic relevance alone does not guarantee improved generation quality.
We propose Pistis-RAG, a new RAG framework designed with a content-centric approach to better align LLMs with human preferences.
arXiv Detail & Related papers (2024-06-21T08:52:11Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - Data-Efficient Alignment of Large Language Models with Human Feedback
Through Natural Language [31.0723480021355]
We investigate data efficiency of modeling human feedback that is in natural language.
We fine-tune an open-source LLM, e.g., Falcon-40B-Instruct, on a relatively small amount of human feedback in natural language.
We show that this model is able to improve the quality of responses from even some of the strongest LLMs.
arXiv Detail & Related papers (2023-11-24T15:20:36Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Principle-Driven Self-Alignment of Language Models from Scratch with
Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions.
This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision.
We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.