Related papers: MISR: Measuring Instrumental Self-Reasoning in Frontier Models

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

URL: http://arxiv.org/abs/2412.03904v1
Date: Thu, 05 Dec 2024 06:20:47 GMT
Title: MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Authors: Kai Fronsdal, David Lindner,
Abstract summary: We evaluate the instrumental self-reasoning ability of large language model (LLM) agents.<n>We find that instrumental self-reasoning ability emerges only in the most capable frontier models.<n>Our evaluations can be used to measure increases in instrumental self-reasoning ability in future models.
Score: 7.414638276983446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models. We open-source our evaluations at https://github.com/kaifronsdal/Self-Reasoning-Evals.

Related papers

When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs [8.575522204707958]
Large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck.<n>A new paradigm is emerging: using AI agents as the evaluators themselves.<n>In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings.
arXiv Detail & Related papers (2025-08-05T01:42:25Z)
Interpretable Risk Mitigation in LLM Agent Systems [0.0]
We explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner's Dilemma.<n>We introduce a strategy-modification method-independent of both the game and the prompt-by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space.
arXiv Detail & Related papers (2025-05-15T19:22:11Z)
Agentic Knowledgeable Self-awareness [79.25908923383776]
KnowSelf is a data-centric approach that applies agents with knowledgeable self-awareness like humans. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge.
arXiv Detail & Related papers (2025-04-04T16:03:38Z)
Measuring AI agent autonomy: Towards a scalable approach with code inspection [8.344207672507334]
We introduce a code-based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks. We demonstrate this approach with the AutoGen framework and select applications.
arXiv Detail & Related papers (2025-02-21T04:58:40Z)
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
AI Sandbagging: Language Models can Strategically Underperform on Evaluations [1.0485739694839669]
Trustlocked AI systems are crucial for ensuring the safety of AI systems. Developers of AI systems may have incentives for sandbagging evaluations. We show that capability evaluations are vulnerable to sandbagging.
arXiv Detail & Related papers (2024-06-11T15:26:57Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Differential Assessment of Black-Box AI Agents [29.98710357871698]
We propose a novel approach to differentially assess black-box AI agents that have drifted from their previously known models. We leverage sparse observations of the drifted agent's current behavior and knowledge of its initial model to generate an active querying policy. Empirical evaluation shows that our approach is much more efficient than re-learning the agent model from scratch.
arXiv Detail & Related papers (2022-03-24T17:48:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.