Empirically evaluating commonsense intelligence in large language models with large-scale human judgments
- URL: http://arxiv.org/abs/2505.10309v1
- Date: Thu, 15 May 2025 13:55:27 GMT
- Title: Empirically evaluating commonsense intelligence in large language models with large-scale human judgments
- Authors: Tuan Dung Nguyen, Duncan J. Watts, Mark E. Whiting,
- Abstract summary: We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
- Score: 4.7206754497888035
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
Related papers
- Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models [93.1043186636177]
We explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations.<n>We propose a computational implementation of this idea -- a Model Synthesis Architecture''<n>We evaluate our MSA as a model of human judgments on a novel reasoning dataset.
arXiv Detail & Related papers (2025-07-16T18:01:03Z) - From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning [52.32745233116143]
Humans organize knowledge into compact categories through semantic compression.<n>Large Language Models (LLMs) demonstrate remarkable linguistic abilities.<n>But whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear.
arXiv Detail & Related papers (2025-05-21T16:29:00Z) - HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation [38.614841553065766]
Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding.<n>HumaniBench is a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite.<n>HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality.
arXiv Detail & Related papers (2025-05-16T17:09:44Z) - Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans [9.315735862658244]
We propose Human-Aligned Bench, a benchmark for alignment of multimodal reasoning with human performance.<n>We collected 9,794 multimodal questions that solely rely on contextual reasoning, including bilingual (Chinese and English) multimodal questions and pure text-based questions.<n>Extensive experiments on the Human-Aligned Bench reveal notable differences between the performance of current MLLMs in multimodal reasoning and human performance.
arXiv Detail & Related papers (2025-05-16T11:41:19Z) - AI-enhanced semantic feature norms for 786 concepts [8.68405554675708]
We introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs)<n>We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts.
arXiv Detail & Related papers (2025-05-15T21:43:34Z) - How do Humans and Language Models Reason About Creativity? A Comparative Analysis [12.398832289718703]
We conducted two experiments examining how including example solutions with ratings impact creativity evaluation.<n>In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training.<n>In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality.
arXiv Detail & Related papers (2025-02-05T15:08:43Z) - Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human labels and those from automatic evaluation.<n>We propose stratifying data by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase.
We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output.
Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z) - Training Socially Aligned Language Models on Simulated Social
Interactions [99.39979111807388]
Social alignment in AI systems aims to ensure that these models behave according to established societal values.
Current language models (LMs) are trained to rigidly replicate their training corpus in isolation.
This work presents a novel training paradigm that permits LMs to learn from simulated social interactions.
arXiv Detail & Related papers (2023-05-26T14:17:36Z) - Human and AI Perceptual Differences in Image Classification Errors [13.045020949359621]
This study first analyzes the statistical distributions of mistakes from the two sources and then explores how task difficulty level affects these distributions.<n>We find that even when AI learns an excellent model from the training data, one that outperforms humans in overall accuracy, these AI models have significant and consistent differences from human perception.
arXiv Detail & Related papers (2023-04-18T05:09:07Z) - Out of One, Many: Using Language Models to Simulate Human Samples [3.278541277919869]
We show that the "algorithmic bias" within one such tool -- the GPT-3 language model -- is both fine-grained and demographically correlated.
We create "silicon samples" by conditioning the model on thousands of socio-demographic backstories from real human participants.
arXiv Detail & Related papers (2022-09-14T19:53:32Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z) - What's the best place for an AI conference, Vancouver or ______: Why
completing comparative questions is difficult [22.04829832439774]
We study the ability of neural LMs to ask (not answer) reasonable questions.
We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable.
arXiv Detail & Related papers (2021-04-05T14:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.