Related papers: Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs

Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs

URL: http://arxiv.org/abs/2601.01878v1
Date: Mon, 05 Jan 2026 08:06:50 GMT
Title: Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs
Authors: Farzan Karimi-Malekabadi, Suhaib Abdurahman, Zhivar Sourati, Jackson Trager, Morteza Dehghani,
Abstract summary: We argue that many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability.<n>Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence.<n>We introduce the Trace Card, a lightweight documentation artifact designed to accompany socio-cognitive evaluations.
Score: 2.98033672654447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability's other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.

Related papers

Understanding Self-supervised Contrastive Learning through Supervised Objectives [2.0305676256390934]
We formulate self-supervised representation learning as an approximation to supervised representation learning objectives.<n>Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss.<n>We empirically validate the effect of balancing positive and negative pair interactions.
arXiv Detail & Related papers (2025-10-12T12:43:03Z)
PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
arXiv Detail & Related papers (2025-10-03T17:09:03Z)
What Expressivity Theory Misses: Message Passing Complexity for GNNs [51.20749443004513]
We argue that higher expressivity is not necessary for most real-world tasks as they rarely require expressivity beyond the basic WL test.<n>We propose Message Passing Complexity (MPC), a measure that quantifies the difficulty for a GNN architecture to solve a given task through message passing.<n>MPC captures practical limitations like over-squashing while preserving the theoretical impossibility results from expressivity theory.
arXiv Detail & Related papers (2025-09-01T08:44:49Z)
When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective [45.419765404078724]
We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge.<n>We propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance.
arXiv Detail & Related papers (2025-08-10T11:23:36Z)
Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective [6.963986923957048]
VAPO is a framework for reinforcement learning for large language models.<n>It addresses challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals.<n>This paper explores VAPO from a theoretical perspective, highlighting areas where its assumptions might be challenged.
arXiv Detail & Related papers (2025-05-23T15:03:41Z)
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation [12.55408229639344]
We provide a structured approach for reasoning about the types of evaluative claims that can be made given the available evidence.<n>Our framework is well-suited for the contemporary paradigm in machine learning.
arXiv Detail & Related papers (2025-05-13T20:36:22Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning [56.574829311863446]
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs)<n>We demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL.
arXiv Detail & Related papers (2025-04-07T13:51:06Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
Kolmogorov-Arnold Networks: A Critical Assessment of Claims, Performance, and Practical Viability [5.871394981352996]
Kolmogorov-Arnold Networks (KANs) have gained significant attention as an alternative to traditional multilayer perceptrons.<n>However, recent systematic evaluations reveal substantial discrepancies between theoretical claims and empirical evidence.
arXiv Detail & Related papers (2024-07-13T04:29:36Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.