Related papers: Evaluating Superhuman Models with Consistency Checks

Evaluating Superhuman Models with Consistency Checks

URL: http://arxiv.org/abs/2306.09983v3
Date: Thu, 19 Oct 2023 12:41:12 GMT
Title: Evaluating Superhuman Models with Consistency Checks
Authors: Lukas Fluri, Daniel Paleka, Florian Tram\`er
Abstract summary: We propose a framework for evaluating superhuman models via consistency checks. We instantiate our framework on three tasks where correctness of decisions is hard to evaluate.
Score: 14.04919745612553
License: http://creativecommons.org/licenses/by/4.0/
Abstract: If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We instantiate our framework on three tasks where correctness of decisions is hard to evaluate due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions, forecasting future events, and making legal judgments. We show that regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making. For example: a chess engine assigning opposing valuations to semantically identical boards; GPT-4 forecasting that sports records will evolve non-monotonically over time; or an AI judge assigning bail to a defendant only after we add a felony to their criminal record.

Related papers

The Silicon Reasonable Person: Can AI Predict How Ordinary People Judge Reasonableness? [0.0]
This Article investigates whether large language models (LLMs) can learn to identify patterns driving human reasonableness judgments.<n>We show that certain models capture not just surface-level responses but potentially their underlying decisional architecture.<n>These findings suggest practical applications: judges could calibrate intuitions against broader patterns, lawmakers could test policy interpretations, and resource-constrained litigants could preview argument reception.
arXiv Detail & Related papers (2025-08-04T06:19:45Z)
Absolute Zero: Reinforced Self-play Reasoning with Zero Data [61.46462130246158]
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models.<n>We introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability.<n>AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models.
arXiv Detail & Related papers (2025-05-06T09:08:00Z)
Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning. Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations. Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z)
On scalable oversight with weak LLMs judging strong LLMs [67.8628575615614]
We study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models.
arXiv Detail & Related papers (2024-07-05T16:29:15Z)
Hacking a surrogate model approach to XAI [49.1574468325115]
We show that even if a discriminated subgroup does not get a positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted. Our approach can be generalized easily to other surrogate models.
arXiv Detail & Related papers (2024-06-24T13:18:02Z)
Modeling Boundedly Rational Agents with Latent Inference Budgets [56.24971011281947]
We introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. We show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty.
arXiv Detail & Related papers (2023-12-07T03:55:51Z)
Designing Closed-Loop Models for Task Allocation [36.04165658325371]
We exploit weak prior information on human-task similarity to bootstrap model training. We show that the use of such a weak prior can improve task allocation accuracy, even when human decision-makers are fallible and biased.
arXiv Detail & Related papers (2023-05-31T13:57:56Z)
Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety [0.0]
We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result. We find that relying on average performance to judge capabilities can be highly misleading. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
arXiv Detail & Related papers (2022-12-13T00:29:45Z)
On the Sensitivity of Reward Inference to Misspecified Human Models [27.94055657571769]
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
arXiv Detail & Related papers (2022-12-09T08:16:20Z)
When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions. A central challenge for AI safety is capturing the flexibility of the human moral mind. We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z)
Humanly Certifying Superhuman Classifiers [8.736864280782592]
Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting.
arXiv Detail & Related papers (2021-09-16T11:00:05Z)
Modeling the Mistakes of Boundedly Rational Agents Within a Bayesian Theory of Mind [32.66203057545608]
We extend the Bayesian Theory of Mind framework to model boundedly rational agents who may have mistaken goals, plans, and actions. We present experiments eliciting human goal inferences in two domains: (i) a gridworld puzzle with gems locked behind doors, and (ii) a block-stacking domain.
arXiv Detail & Related papers (2021-06-24T18:00:03Z)
Why do you think that? Exploring Faithful Sentence-Level Rationales Without Supervision [60.62434362997016]
We propose a differentiable training-framework to create models which output faithful rationales on a sentence level. Our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best.
arXiv Detail & Related papers (2020-10-07T12:54:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.