Related papers: The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

URL: http://arxiv.org/abs/2508.12277v1
Date: Sun, 17 Aug 2025 07:57:58 GMT
Title: The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution
Authors: Elon Ezra, Ariel Weizman, Amos Azaria,
Abstract summary: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities.<n>We introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output.<n>Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance.
Score: 13.62116438805314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.

Related papers

Quantifying construct validity in large language model evaluations [0.0]
LLM community often reports benchmark results as if they are synonymous with general model capabilities.<n> benchmarks can have problems that distort performance, like test set contamination and annotator error.<n>How can we know that a benchmark is a reliable indicator of some capability that we want to measure?
arXiv Detail & Related papers (2026-02-17T12:15:57Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Towards Reasoning Ability of Small Language Models [7.12809444398765]
This paper introduces ThinkSLM, the first benchmark to systematically evaluate and study the reasoning abilities of SLMs.<n>We present a study evaluating 72 diverse SLMs from six major model families across 17 reasoning benchmarks.<n>Our findings challenge the assumption that scaling is the only way to achieve strong reasoning.
arXiv Detail & Related papers (2025-02-17T08:59:16Z)
CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models [18.975064947089805]
Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare.<n>We provide a benchmark, named by CARL-GT, which evaluates CAusal Reasoning capabilities of large Language models using Graphs and Tabular data.
arXiv Detail & Related papers (2024-12-23T20:34:32Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses [49.148206387394936]
We show that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.
arXiv Detail & Related papers (2024-04-04T20:27:37Z)
LLMs May Perform MCQA by Selecting the Least Incorrect Option [29.202758753639078]
Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks.<n>The adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction.<n>However, concerns regarding the robustness of this evaluative method persist.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games [14.063311955315077]
Large language models (LLMs) are effective at answering questions that are clearly asked. When faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
arXiv Detail & Related papers (2023-10-02T16:55:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.