A Framework for Robust Cognitive Evaluation of LLMs
- URL: http://arxiv.org/abs/2504.02789v1
- Date: Thu, 03 Apr 2025 17:35:54 GMT
- Title: A Framework for Robust Cognitive Evaluation of LLMs
- Authors: Karin de Langis, Jong Inn Park, Bin Hu, Khanh Chi Le, Andreas Schramm, Michael C. Mensink, Andrew Elfenbein, Dongyeop Kang,
- Abstract summary: Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood.<n>We develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs.
- Score: 13.822169295436177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.
Related papers
- How Metacognitive Architectures Remember Their Own Thoughts: A Systematic Review [16.35521789216079]
We review how Computational Metacognitive Architectures (CMAs) model, store, remember and process their metacognitive experiences.
We consider different aspects - ranging from the underlying psychological theories to the content and structure of collected data, to the algorithms used and evaluation results.
arXiv Detail & Related papers (2025-02-28T08:48:41Z) - VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models [62.667142971664575]
We introduce VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT)<n>VisFactor digitalizes vision-related FRCT subtests to systematically evaluate MLLMs across essential visual cognitive tasks.<n>We present a comprehensive evaluation of state-of-the-art MLLMs, such as GPT-4o, Gemini-Pro, and Qwen-VL.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs [64.9693406713216]
Internal mechanisms that contribute to the effectiveness of RAG systems remain underexplored.
Our experiments reveal that several core groups of experts are primarily responsible for RAG-related behaviors.
We propose several strategies to enhance RAG's efficiency and effectiveness through expert activation.
arXiv Detail & Related papers (2024-10-20T16:08:54Z) - Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges [14.739357670600102]
This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science.
We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models.
We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance.
arXiv Detail & Related papers (2024-09-04T02:30:12Z) - Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making.
Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations.
Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z) - CogLM: Tracking Cognitive Development of Large Language Models [20.138831477848615]
We construct a benchmark CogLM based on Piaget's Theory of Cognitive Development.<n>CogLM comprises 1,220 questions spanning 10 cognitive abilities crafted by more than 20 human experts.<n>We find that advanced LLMs have demonstrated human-like cognitive abilities, comparable to those of a 20-year-old human.
arXiv Detail & Related papers (2024-08-17T09:49:40Z) - LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations.
We introduce Scientific Generative Agent (SGA), a bilevel optimization framework.
We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z) - CogGPT: Unleashing the Power of Cognitive Dynamics on Large Language Models [24.079412787914993]
We propose the concept of the cognitive dynamics of large language models (LLMs) and present a corresponding task with the inspiration of longitudinal studies.
Towards the task, we develop CogBench, a novel benchmark to assess the cognitive dynamics of LLMs and validate it through participant surveys.
We introduce CogGPT for the task, which features an innovative iterative cognitive mechanism aimed at enhancing lifelong cognitive dynamics.
arXiv Detail & Related papers (2024-01-06T03:59:59Z) - A Neuro-mimetic Realization of the Common Model of Cognition via Hebbian
Learning and Free Energy Minimization [55.11642177631929]
Large neural generative models are capable of synthesizing semantically rich passages of text or producing complex images.
We discuss the COGnitive Neural GENerative system, such an architecture that casts the Common Model of Cognition.
arXiv Detail & Related papers (2023-10-14T23:28:48Z) - Exploring the Cognitive Knowledge Structure of Large Language Models: An
Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence.
Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains.
We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.