Related papers: CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models

CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models

URL: http://arxiv.org/abs/2601.15628v1
Date: Thu, 22 Jan 2026 03:59:19 GMT
Title: CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models
Authors: Haibo Tong, Zeyang Yue, Feifei Zhao, Erliang Lin, Lu Jia, Ruolin Chen, Yinqian Sun, Qian Zhang, Yi Zeng,
Abstract summary: We introduce CogToM, a comprehensive, theoretically grounded benchmark comprising over 8000 bilingual instances across 46 paradigms.<n>A systematic evaluation of 22 representative models, including frontier models like GPT-5.1 and Qwen3-Max, reveals significant performance heterogeneities and highlights persistent bottlenecks in specific dimensions.<n>CogToM offers a robust instrument and perspective for investigating the evolving cognitive boundaries of Large Language Models.
Score: 8.120889327955032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whether Large Language Models (LLMs) truly possess human-like Theory of Mind (ToM) capabilities has garnered increasing attention. However, existing benchmarks remain largely restricted to narrow paradigms like false belief tasks, failing to capture the full spectrum of human cognitive mechanisms. We introduce CogToM, a comprehensive, theoretically grounded benchmark comprising over 8000 bilingual instances across 46 paradigms, validated by 49 human annotator.A systematic evaluation of 22 representative models, including frontier models like GPT-5.1 and Qwen3-Max, reveals significant performance heterogeneities and highlights persistent bottlenecks in specific dimensions. Further analysis based on human cognitive patterns suggests potential divergences between LLM and human cognitive structures. CogToM offers a robust instrument and perspective for investigating the evolving cognitive boundaries of LLMs.

Related papers

UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis [69.50752734049985]
A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans.<n>We propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space.
arXiv Detail & Related papers (2026-01-25T16:19:00Z)
HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns [59.17423586203706]
We present HUMANLLM, a framework treating psychological patterns as interacting causal forces.<n>We construct 244 patterns from 12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other.<n>Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment.
arXiv Detail & Related papers (2026-01-15T08:56:53Z)
Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning.<n>We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations.<n>We develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems.
arXiv Detail & Related papers (2025-11-20T18:59:00Z)
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity [28.797461492275488]
MME-CC is a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information.<n>Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs.<n>We identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions.
arXiv Detail & Related papers (2025-11-05T03:09:16Z)
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models [118.44328586173556]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks.<n>Human-MME is a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding.<n>Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding.
arXiv Detail & Related papers (2025-09-30T12:20:57Z)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [5.02953506943752]
MM-IQ is a comprehensive evaluation framework that comprises a large-scale training set with 4,776 visual reasoning problems and 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.<n>Our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance.<n>Inspired by the recent surge of large reasoning models, we also release a multimodal reasoning model as the baseline that is trained via reinforcement learning with verifiable reward functions.
arXiv Detail & Related papers (2025-02-02T07:12:03Z)
Core Knowledge Deficits in Multi-Modal Language Models [41.422258645731276]
Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning.<n>But their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans.<n>We examine the hypothesis that these deficiencies stem from the absence of core knowledge--rudimentary cognitive abilities innate to humans from early childhood.
arXiv Detail & Related papers (2024-10-06T20:13:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.