On the Measure of a Model: From Intelligence to Generality
- URL: http://arxiv.org/abs/2511.11773v1
- Date: Fri, 14 Nov 2025 09:46:48 GMT
- Title: On the Measure of a Model: From Intelligence to Generality
- Authors: Ruchira Dhar, Ninell Oldenburg, Anders Soegaard,
- Abstract summary: Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs)<n>Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding.<n>Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence.
- Score: 0.7561750463371523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.
Related papers
- The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning [0.0]
We argue that the field is undergoing a paradigm shift, moving from simple recognition tasks to complex reasoning benchmarks.<n>We chart the journey from the foundational "knowledge tests" of the ImageNet era to the "applied logic and comprehension" exams.<n>We explore the uncharted territories of evaluating abstract, creative, and social intelligence.
arXiv Detail & Related papers (2025-10-05T10:41:22Z) - Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence [59.07578850674114]
Sound deductive reasoning is an indisputably desirable aspect of general intelligence.<n>It is well-documented that even the most advanced frontier systems regularly and consistently falter on easily-solvable reasoning tasks.<n>We argue that their unsound behavior is a consequence of the statistical learning approach powering their development.
arXiv Detail & Related papers (2025-06-30T14:37:50Z) - AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence [0.0]
The Artificial General Intelligence Testbed (AGITB) introduces a novel benchmarking suite comprising fourteen elementary tests.<n>AGITB evaluates models on their ability to forecast the next input in a temporal sequence, step by step, without pretraining.<n>The human cortex satisfies all tests, whereas no current AI system meets the full AGITB criteria.
arXiv Detail & Related papers (2025-04-06T10:01:15Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Evaluating and Improving Continual Learning in Spoken Language
Understanding [58.723320551761525]
We propose an evaluation methodology that provides a unified evaluation on stability, plasticity, and generalizability in continual learning.
By employing the proposed metric, we demonstrate how introducing various knowledge distillations can improve different aspects of these three properties of the SLU model.
arXiv Detail & Related papers (2024-02-16T03:30:27Z) - Integration of cognitive tasks into artificial general intelligence test
for large models [54.72053150920186]
We advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests.
The cognitive science-inspired AGI tests encompass the full spectrum of intelligence facets, including crystallized intelligence, fluid intelligence, social intelligence, and embodied intelligence.
arXiv Detail & Related papers (2024-02-04T15:50:42Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - Beyond Interpretable Benchmarks: Contextual Learning through Cognitive
and Multimodal Perception [0.0]
This study contends that the Turing Test is misinterpreted as an attempt to anthropomorphize computer systems.
It emphasizes tacit learning as a cornerstone of general-purpose intelligence, despite its lack of overt interpretability.
arXiv Detail & Related papers (2022-12-04T08:30:04Z) - What's a Good Prediction? Challenges in evaluating an agent's knowledge [0.9281671380673306]
We show the conflict between accuracy and usefulness of general knowledge.
We propose an alternate evaluation approach that arises continually in the online continual learning setting.
This paper contributes a first look into evaluation of predictions through their use.
arXiv Detail & Related papers (2020-01-23T21:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.