Evaluating Superhuman Models with Consistency Checks
- URL: http://arxiv.org/abs/2306.09983v3
- Date: Thu, 19 Oct 2023 12:41:12 GMT
- Title: Evaluating Superhuman Models with Consistency Checks
- Authors: Lukas Fluri, Daniel Paleka, Florian Tram\`er
- Abstract summary: We propose a framework for evaluating superhuman models via consistency checks.
We instantiate our framework on three tasks where correctness of decisions is hard to evaluate.
- Score: 14.04919745612553
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: If machine learning models were to achieve superhuman abilities at various
reasoning or decision-making tasks, how would we go about evaluating such
models, given that humans would necessarily be poor proxies for ground truth?
In this paper, we propose a framework for evaluating superhuman models via
consistency checks. Our premise is that while the correctness of superhuman
decisions may be impossible to evaluate, we can still surface mistakes if the
model's decisions fail to satisfy certain logical, human-interpretable rules.
We instantiate our framework on three tasks where correctness of decisions is
hard to evaluate due to either superhuman model abilities, or to otherwise
missing ground truth: evaluating chess positions, forecasting future events,
and making legal judgments. We show that regardless of a model's (possibly
superhuman) performance on these tasks, we can discover logical inconsistencies
in decision making. For example: a chess engine assigning opposing valuations
to semantically identical boards; GPT-4 forecasting that sports records will
evolve non-monotonically over time; or an AI judge assigning bail to a
defendant only after we add a felony to their criminal record.
Related papers
- Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning.
Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations.
Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z) - On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions.
We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z) - Hacking a surrogate model approach to XAI [49.1574468325115]
We show that even if a discriminated subgroup does not get a positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted.
Our approach can be generalized easily to other surrogate models.
arXiv Detail & Related papers (2024-06-24T13:18:02Z) - Modeling Boundedly Rational Agents with Latent Inference Budgets [56.24971011281947]
We introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly.
L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors.
We show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty.
arXiv Detail & Related papers (2023-12-07T03:55:51Z) - Designing Closed-Loop Models for Task Allocation [36.04165658325371]
We exploit weak prior information on human-task similarity to bootstrap model training.
We show that the use of such a weak prior can improve task allocation accuracy, even when human decision-makers are fallible and biased.
arXiv Detail & Related papers (2023-05-31T13:57:56Z) - Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety [0.0]
We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result.
We find that relying on average performance to judge capabilities can be highly misleading.
We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
arXiv Detail & Related papers (2022-12-13T00:29:45Z) - On the Sensitivity of Reward Inference to Misspecified Human Models [27.94055657571769]
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want.
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
arXiv Detail & Related papers (2022-12-09T08:16:20Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Humanly Certifying Superhuman Classifiers [8.736864280782592]
Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research.
We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference.
Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting.
arXiv Detail & Related papers (2021-09-16T11:00:05Z) - Modeling the Mistakes of Boundedly Rational Agents Within a Bayesian
Theory of Mind [32.66203057545608]
We extend the Bayesian Theory of Mind framework to model boundedly rational agents who may have mistaken goals, plans, and actions.
We present experiments eliciting human goal inferences in two domains: (i) a gridworld puzzle with gems locked behind doors, and (ii) a block-stacking domain.
arXiv Detail & Related papers (2021-06-24T18:00:03Z) - Why do you think that? Exploring Faithful Sentence-Level Rationales
Without Supervision [60.62434362997016]
We propose a differentiable training-framework to create models which output faithful rationales on a sentence level.
Our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best.
arXiv Detail & Related papers (2020-10-07T12:54:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.