Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety
- URL: http://arxiv.org/abs/2212.06295v1
- Date: Tue, 13 Dec 2022 00:29:45 GMT
- Title: Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety
- Authors: Joshua Albrecht, Ellie Kitanidis, Abraham J. Fetterman
- Abstract summary: We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result.
We find that relying on average performance to judge capabilities can be highly misleading.
We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have exploded in popularity in the past few
years and have achieved undeniably impressive results on benchmarks as varied
as question answering and text summarization. We provide a simple new prompting
strategy that leads to yet another supposedly "super-human" result, this time
outperforming humans at common sense ethical reasoning (as measured by accuracy
on a subset of the ETHICS dataset). Unfortunately, we find that relying on
average performance to judge capabilities can be highly misleading. LLM errors
differ systematically from human errors in ways that make it easy to craft
adversarial examples, or even perturb existing examples to flip the output
label. We also observe signs of inverse scaling with model size on some
examples, and show that prompting models to "explain their reasoning" often
leads to alarming justifications of unethical actions. Our results highlight
how human-like performance does not necessarily imply human-like understanding
or reasoning.
Related papers
- Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and Biases [14.650234624251716]
Large language models (LLMs) are increasingly being used in human-centered social scientific tasks.
These tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences.
We examine the role of prompting LLMs with human-like personas and ask the models to answer as if they were a specific human.
arXiv Detail & Related papers (2024-06-20T16:24:07Z) - Relevant or Random: Can LLMs Truly Perform Analogical Reasoning? [44.158548608820624]
Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences.
The NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts.
We show that self-generated random examples can surprisingly achieve comparable or even better performance.
arXiv Detail & Related papers (2024-04-19T09:15:07Z) - Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content.
We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z) - Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models [7.779982757267302]
We investigate the generality of analogy-making abilities previously claimed for large language models (LLMs)
We show that while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set.
arXiv Detail & Related papers (2024-02-14T05:52:23Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - MoCa: Measuring Human-Language Model Alignment on Causal and Moral
Judgment Tasks [49.60689355674541]
A rich literature in cognitive science has studied people's causal and moral intuitions.
This work has revealed a number of factors that systematically influence people's judgments.
We test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with human participants.
arXiv Detail & Related papers (2023-10-30T15:57:32Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z) - Humanly Certifying Superhuman Classifiers [8.736864280782592]
Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research.
We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference.
Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting.
arXiv Detail & Related papers (2021-09-16T11:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.