Annotation alignment: Comparing LLM and human annotations of conversational safety
- URL: http://arxiv.org/abs/2406.06369v3
- Date: Mon, 24 Jun 2024 23:23:22 GMT
- Title: Annotation alignment: Comparing LLM and human annotations of conversational safety
- Authors: Rajiv Movva, Pang Wei Koh, Emma Pierson,
- Abstract summary: We show that GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average.
There is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment.
- Score: 10.143093546513857
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To what extent do LLMs align with human perceptions of safety? We study this question via *annotation alignment*, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with demographic groups. Also, there is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
Related papers
- Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification [4.352835414206441]
Political biases encoded by LLMs might have detrimental effects on downstream applications.<n>We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence.<n>We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages.
arXiv Detail & Related papers (2025-05-26T10:01:24Z) - How Inclusively do LMs Perceive Social and Moral Norms? [5.302888878095751]
Language models (LMs) are used in decision-making systems and as interactive assistants.
We investigate how inclusively LMs perceive norms across demographic groups.
We find notable disparities in LM responses, with younger, higher-income groups showing closer alignment.
arXiv Detail & Related papers (2025-02-04T20:24:17Z) - Assessing Gender Bias in LLMs: Comparing LLM Outputs with Human Perceptions and Official Statistics [0.0]
This study investigates gender bias in large language models (LLMs)
We compare their gender perception to that of human respondents, U.S. Bureau of Labor Statistics data, and a 50% no-bias benchmark.
arXiv Detail & Related papers (2024-11-20T22:43:18Z) - With a Grain of SALT: Are LLMs Fair Across Social Dimensions? [3.979019316355144]
This paper presents an analysis of biases in open-source Large Language Models (LLMs) across various genders, religions, and races.
We introduce a methodology for generating a bias detection dataset using seven bias triggers: General Debate, Positioned Debate, Career Advice, Story Generation, Problem-Solving, Cover-Letter Writing, and CV Generation.
We anonymise the LLM-generated text associated with each group using GPT-4o-mini and do a pairwise comparison using GPT-4o-as-a-Judge.
arXiv Detail & Related papers (2024-10-16T12:22:47Z) - Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking [56.275521022148794]
Post-training methods claim superior alignment by virtue of better correspondence with human pairwise preferences.
We attempt to answer the question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not?
We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning stage of post-training, and not the PO stage, has the greatest impact on alignment.
arXiv Detail & Related papers (2024-09-23T17:58:07Z) - How Aligned are Different Alignment Metrics? [6.172390472790253]
We analyze visual data from Brain-Score, together with human similarity and alignment metrics.
We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative.
Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics.
arXiv Detail & Related papers (2024-07-10T10:36:11Z) - PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference [14.530969790956242]
The PKU-SafeRLHF dataset is designed to promote research on safety alignment in large language models (LLMs)<n>As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs.<n>Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe.
arXiv Detail & Related papers (2024-06-20T18:37:36Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - A Closer Look into Automatic Evaluation Using Large Language Models [75.49360351036773]
We discuss how details in the evaluation process change how well the ratings given by LLMs correlate with human ratings.
We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings.
We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal.
arXiv Detail & Related papers (2023-10-09T12:12:55Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.