Related papers: Can the capability of Large Language Models be described by human ability? A Meta Study

Can the capability of Large Language Models be described by human ability? A Meta Study

URL: http://arxiv.org/abs/2504.12332v1
Date: Sun, 13 Apr 2025 08:34:11 GMT
Title: Can the capability of Large Language Models be described by human ability? A Meta Study
Authors: Mingrui Zan, Yunquan Zhang, Boyang Zhang, Fangming Liu, Daning Cheng,
Abstract summary: We collected performance data from over 80 models across 37 evaluation benchmarks.<n>We have confirmed that certain capabilities of LLMs with fewer than 10 billion parameters can indeed be described.<n>While some abilities are considered interrelated in humans, they appear nearly uncorrelated in LLMs.
Score: 10.516198272048488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Users of Large Language Models (LLMs) often perceive these models as intelligent entities with human-like capabilities. However, the extent to which LLMs' capabilities truly approximate human abilities remains a topic of debate. In this paper, to characterize the capabilities of LLMs in relation to human capabilities, we collected performance data from over 80 models across 37 evaluation benchmarks. The evaluation benchmarks are categorized into 6 primary abilities and 11 sub-abilities in human aspect. Then, we then clustered the performance rankings into several categories and compared these clustering results with classifications based on human ability aspects. Our findings lead to the following conclusions: 1. We have confirmed that certain capabilities of LLMs with fewer than 10 billion parameters can indeed be described using human ability metrics; 2. While some abilities are considered interrelated in humans, they appear nearly uncorrelated in LLMs; 3. The capabilities possessed by LLMs vary significantly with the parameter scale of the model.

Related papers

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories. These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks [17.5336703613751]
This study benchmarks leading large language models and vision language models against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV) Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI) from multimodal models.
arXiv Detail & Related papers (2024-10-09T19:22:26Z)
Law of the Weakest Link: Cross Capabilities of Large Language Models [102.91861246827797]
We show that Large Language Models (LLMs) exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. These results highlight the under-performance of LLMs in cross-capability tasks.
arXiv Detail & Related papers (2024-09-30T05:12:01Z)
Human-like object concept representations emerge naturally in multimodal large language models [24.003766123531545]
We combined behavioral and neuroimaging analysis methods to uncover how the object concept representations in Large Language Models correlate with those of humans. The resulting 66-dimensional embeddings were found to be highly stable and predictive, and exhibited semantic clustering akin to human mental representations. This study advances our understanding of machine intelligence and informs the development of more human-like artificial cognitive systems.
arXiv Detail & Related papers (2024-07-01T08:17:19Z)
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments.<n>We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data.<n>We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z)
Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial general intelligence or achievement? [0.0]
Large language models (LLMs) are advanced artificial intelligence (AI) systems that can perform a variety of tasks commonly found in human intelligence tests. We investigated whether test scores may also exhibit positive intercorrelations. We found strong empirical evidence for a positive manifold and a general factor of ability.
arXiv Detail & Related papers (2023-10-17T22:42:12Z)
Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates. We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)
Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark [14.922083834969323]
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. We introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge.
arXiv Detail & Related papers (2023-05-24T09:21:06Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
Dissociating language and thought in large language models [52.39241645471213]
Large Language Models (LLMs) have come closest among all models to date to mastering human language. We ground this distinction in human neuroscience, which has shown that formal and functional competence rely on different neural mechanisms. Although LLMs are surprisingly good at formal competence, their performance on functional competence tasks remains spotty.
arXiv Detail & Related papers (2023-01-16T22:41:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.