Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety
- URL: http://arxiv.org/abs/2212.06295v1
- Date: Tue, 13 Dec 2022 00:29:45 GMT
- Title: Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety
- Authors: Joshua Albrecht, Ellie Kitanidis, Abraham J. Fetterman
- Abstract summary: We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result.
We find that relying on average performance to judge capabilities can be highly misleading.
We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have exploded in popularity in the past few
years and have achieved undeniably impressive results on benchmarks as varied
as question answering and text summarization. We provide a simple new prompting
strategy that leads to yet another supposedly "super-human" result, this time
outperforming humans at common sense ethical reasoning (as measured by accuracy
on a subset of the ETHICS dataset). Unfortunately, we find that relying on
average performance to judge capabilities can be highly misleading. LLM errors
differ systematically from human errors in ways that make it easy to craft
adversarial examples, or even perturb existing examples to flip the output
label. We also observe signs of inverse scaling with model size on some
examples, and show that prompting models to "explain their reasoning" often
leads to alarming justifications of unethical actions. Our results highlight
how human-like performance does not necessarily imply human-like understanding
or reasoning.
Related papers
- Smaller Large Language Models Can Do Moral Self-Correction [7.899707459486236]
Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs)
Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update.
Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction.
arXiv Detail & Related papers (2024-10-30T22:58:57Z) - How Aligned are Generative Models to Humans in High-Stakes Decision-Making? [10.225573060836478]
Large generative models (LMs) are increasingly being considered for high-stakes decision-making.
This work considers how such models compare to humans and predictive AI models on a specific case of recidivism prediction.
arXiv Detail & Related papers (2024-10-20T19:00:59Z) - Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges [6.609843448260634]
The LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models.
This paper focuses on a clean scenario in which inter-human agreement is high.
We identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency.
arXiv Detail & Related papers (2024-06-18T13:49:54Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models [7.779982757267302]
We investigate the generality of analogy-making abilities previously claimed for large language models (LLMs)
We show that while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set.
arXiv Detail & Related papers (2024-02-14T05:52:23Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - MoCa: Measuring Human-Language Model Alignment on Causal and Moral
Judgment Tasks [49.60689355674541]
A rich literature in cognitive science has studied people's causal and moral intuitions.
This work has revealed a number of factors that systematically influence people's judgments.
We test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with human participants.
arXiv Detail & Related papers (2023-10-30T15:57:32Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z) - Humanly Certifying Superhuman Classifiers [8.736864280782592]
Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research.
We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference.
Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting.
arXiv Detail & Related papers (2021-09-16T11:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.