AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms
- URL: http://arxiv.org/abs/2508.19004v1
- Date: Tue, 26 Aug 2025 13:03:56 GMT
- Title: AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms
- Authors: Pontus Strimling, Simon Karlsson, Irina Vartanova, Kimmo Eriksson,
- Abstract summary: We investigate whether large language models can achieve sophisticated norm understanding through statistical learning alone.<n>Across two studies, we evaluate multiple AI systems' ability to predict human social appropriateness judgments.<n>Despite this predictive power, all models showed systematic, correlated errors.
- Score: 0.4666493857924357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A fundamental question in cognitive science concerns how social norms are acquired and represented. While humans typically learn norms through embodied social experience, we investigated whether large language models can achieve sophisticated norm understanding through statistical learning alone. Across two studies, we systematically evaluated multiple AI systems' ability to predict human social appropriateness judgments for 555 everyday scenarios by examining how closely they predicted the average judgment compared to each human participant. In Study 1, GPT-4.5's accuracy in predicting the collective judgment on a continuous scale exceeded that of every human participant (100th percentile). Study 2 replicated this, with Gemini 2.5 Pro outperforming 98.7% of humans, GPT-5 97.8%, and Claude Sonnet 4 96.0%. Despite this predictive power, all models showed systematic, correlated errors. These findings demonstrate that sophisticated models of social cognition can emerge from statistical learning over linguistic data alone, challenging strong versions of theories emphasizing the exclusive necessity of embodied experience for cultural competence. The systematic nature of AI limitations across different architectures indicates potential boundaries of pattern-based social understanding, while the models' ability to outperform nearly all individual humans in this predictive task suggests that language serves as a remarkably rich repository for cultural knowledge transmission.
Related papers
- HumanLLM: Towards Personalized Understanding and Simulation of Human Nature [72.55730315685837]
HumanLLM is a foundation model designed for personalized understanding and simulation of individuals.<n>We first construct the Cognitive Genome, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon.<n>We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences.
arXiv Detail & Related papers (2026-01-22T09:27:27Z) - HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns [59.17423586203706]
We present HUMANLLM, a framework treating psychological patterns as interacting causal forces.<n>We construct 244 patterns from 12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other.<n>Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment.
arXiv Detail & Related papers (2026-01-15T08:56:53Z) - Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test [62.17144846428715]
We introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val)<n>Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization and execution.<n>For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world.
arXiv Detail & Related papers (2026-01-07T17:50:37Z) - The Catastrophic Paradox of Human Cognitive Frameworks in Large Language Model Evaluation: A Comprehensive Empirical Analysis of the CHC-LLM Incompatibility [0.0]
Models achieving above-average human IQ scores simultaneously exhibit binary accuracy rates approaching zero on crystallized knowledge tasks.<n>This disconnect appears most strongly in the crystallized intelligence domain.<n>We propose a framework for developing native machine cognition assessments that recognize the non-human nature of artificial intelligence.
arXiv Detail & Related papers (2025-11-23T05:49:57Z) - Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.7206754497888035]
We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z) - Deterministic AI Agent Personality Expression through Standard Psychological Diagnostics [0.0]
We show that AI models can express deterministic and consistent personalities when instructed using established psychological frameworks.<n>More advanced models like GPT-4o and o1 demonstrate the highest accuracy in expressing specified personalities.<n>These findings establish a foundation for creating AI agents with diverse and consistent personalities.
arXiv Detail & Related papers (2025-03-21T12:12:05Z) - Social Genome: Grounded Social Reasoning Abilities of Multimodal Models [61.88413918026431]
Social reasoning abilities are crucial for AI systems to interpret and respond to multimodal human communication and interaction within social contexts.<n>We introduce SOCIAL GENOME, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models.
arXiv Detail & Related papers (2025-02-21T00:05:40Z) - The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks [17.5336703613751]
This study benchmarks leading large language models and vision language models against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV)
Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers.
Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI) from multimodal models.
arXiv Detail & Related papers (2024-10-09T19:22:26Z) - Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities [29.18360187129556]
This study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition.
We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP)
Extensive experiments and analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities.
arXiv Detail & Related papers (2024-05-20T07:34:48Z) - Large Language Models Can Infer Psychological Dispositions of Social Media Users [1.0923877073891446]
We test whether GPT-3.5 and GPT-4 can derive the Big Five personality traits from users' Facebook status updates in a zero-shot learning scenario.
Our results show an average correlation of r =.29 (range = [.22,.33]) between LLM-inferred and self-reported trait scores.
predictions were found to be more accurate for women and younger individuals on several traits, suggesting a potential bias stemming from the underlying training data or differences in online self-expression.
arXiv Detail & Related papers (2023-09-13T01:27:48Z) - Training Socially Aligned Language Models on Simulated Social
Interactions [99.39979111807388]
Social alignment in AI systems aims to ensure that these models behave according to established societal values.
Current language models (LMs) are trained to rigidly replicate their training corpus in isolation.
This work presents a novel training paradigm that permits LMs to learn from simulated social interactions.
arXiv Detail & Related papers (2023-05-26T14:17:36Z) - Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards
Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent.
Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally.
We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z) - Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI)
This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.