Measuring Progress on Scalable Oversight for Large Language Models
- URL: http://arxiv.org/abs/2211.03540v1
- Date: Fri, 4 Nov 2022 17:03:49 GMT
- Title: Measuring Progress on Scalable Oversight for Large Language Models
- Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit,
Scott Heiner, Kamile Lukosuite, Amanda Askell, Andy Jones, Anna Chen, Anna
Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela
Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson
Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal
Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph,
Noem\'i Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu,
Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy
Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac
Hatfield-Dodds, Ben Mann, Jared Kaplan
- Abstract summary: We present an experimental design centered on choosing tasks for which human specialists succeed but unaided humans and current general AI systems fail.
We find that human participants who interact with an unreliable large-language-model dialog assistant through chat substantially outperform both the model alone and their own unaided performance.
- Score: 19.705153174673576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing safe and useful general-purpose AI systems will require us to make
progress on scalable oversight: the problem of supervising systems that
potentially outperform us on most skills relevant to the task at hand.
Empirical work on this problem is not straightforward, since we do not yet have
systems that broadly exceed our abilities. This paper discusses one of the
major ways we think about this problem, with a focus on how to turn it into one
that can be productively studied empirically. We first present an experimental
design centered on choosing tasks for which human specialists succeed but
unaided humans and current general AI systems fail. We then present a
proof-of-concept experiment following meant to demonstrate a key feature of
this experimental design and show its viability with two question-answering
tasks: MMLU and time-limited QuALITY. On these tasks, we find that human
participants who interact with an unreliable large-language-model dialog
assistant through chat -- a trivial baseline strategy for scalable oversight --
substantially outperform both the model alone and their own unaided
performance. These results are an encouraging sign that scalable oversight will
be tractable to study with present models and bolster recent findings that
large language models can productively assist humans with difficult tasks.
Related papers
- BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
We introduce BloomWise, a new prompting technique, inspired by Bloom's taxonomy, to improve the performance of Large Language Models (LLMs)
The decision regarding the need to employ more sophisticated cognitive skills is based on self-evaluation performed by the LLM.
In extensive experiments across 4 popular math reasoning datasets, we have demonstrated the effectiveness of our proposed approach.
arXiv Detail & Related papers (2024-10-05T09:27:52Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Predicting and Understanding Human Action Decisions: Insights from Large Language Models and Cognitive Instance-Based Learning [0.0]
Large Language Models (LLMs) have demonstrated their capabilities across various tasks.
This paper exploits the reasoning and generative capabilities of the LLMs to predict human behavior in two sequential decision-making tasks.
We compare the performance of LLMs with a cognitive instance-based learning model, which imitates human experiential decision-making.
arXiv Detail & Related papers (2024-07-12T14:13:06Z) - Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision [98.97575836717931]
Current AI alignment methodologies rely on human-provided demonstrations or judgments.
This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans?
arXiv Detail & Related papers (2024-03-14T15:12:38Z) - On the Challenges and Opportunities in Generative AI [135.2754367149689]
We argue that current large-scale generative AI models do not sufficiently address several fundamental issues that hinder their widespread adoption across domains.
In this work, we aim to identify key unresolved challenges in modern generative AI paradigms that should be tackled to further enhance their capabilities, versatility, and reliability.
arXiv Detail & Related papers (2024-02-28T15:19:33Z) - Solving the Right Problem is Key for Translational NLP: A Case Study in
UMLS Vocabulary Insertion [12.855898113768998]
We study the case of UMLS vocabulary insertion, an important real-world task in which hundreds of thousands of new terms are added to the UMLS.
We introduce a new formulation for UMLS vocabulary insertion which mirrors the real-world task.
We also propose an effective rule-enhanced biomedical language model which enables important new model behavior.
arXiv Detail & Related papers (2023-11-25T19:35:53Z) - Can Foundation Models Watch, Talk and Guide You Step by Step to Make a
Cake? [62.59699229202307]
Despite advances in AI, it remains a significant challenge to develop interactive task guidance systems.
We created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor.
We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance.
arXiv Detail & Related papers (2023-11-01T15:13:49Z) - Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for
Instruction Generation Models [5.975913042883176]
Recent work studies the cognitive capabilities of language models through psychological tests designed for humans.
We formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks.
arXiv Detail & Related papers (2022-12-21T04:43:19Z) - Human in the loop approaches in multi-modal conversational task guidance
system development [6.493148232868973]
Development of task guidance systems for aiding humans in a situated task remains a challenging problem.
We first highlight some of the challenges involved during the development of such systems.
We then provide an overview of existing datasets available and highlight their limitations.
arXiv Detail & Related papers (2022-11-03T14:05:30Z) - Watch-And-Help: A Challenge for Social Perception and Human-AI
Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents.
In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently.
We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.