Reflection-Bench: Evaluating Epistemic Agency in Large Language Models
- URL: http://arxiv.org/abs/2410.16270v3
- Date: Wed, 04 Jun 2025 11:45:19 GMT
- Title: Reflection-Bench: Evaluating Epistemic Agency in Large Language Models
- Authors: Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang,
- Abstract summary: Epistemic agency is the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments.<n>We propose Reflection-Bench, a benchmark consisting of seven tasks with long-term relevance and minimization of data leakage.<n>Our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms.
- Score: 10.801745760525838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With large language models (LLMs) increasingly deployed as cognitive engines for AI agents, the reliability and effectiveness critically hinge on their intrinsic epistemic agency, which remains understudied. Epistemic agency, the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments, represents a base-model-level capacity independent of specific tools, modules, or applications. We characterize the holistic process underlying epistemic agency, which unfolds in seven interrelated dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, we propose Reflection-Bench, a cognitive-psychology-inspired benchmark consisting of seven tasks with long-term relevance and minimization of data leakage. Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. While state-of-the-art LLMs demonstrate rudimentary signs of epistemic agency, our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms. Our code and data are available at https://github.com/AI45Lab/ReflectionBench.
Related papers
- A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence [87.08051686357206]
Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static.<n>As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck.<n>This survey provides the first systematic and comprehensive review of self-evolving agents.
arXiv Detail & Related papers (2025-07-28T17:59:05Z) - Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research [32.92036657863354]
Language agents powered by large language models (LLMs) have demonstrated remarkable capabilities in understanding, reasoning, and executing complex tasks.<n>However, developing robust agents presents significant challenges: substantial engineering overhead, lack of standardized components, and insufficient evaluation frameworks for fair comparison.<n>We introduce Agent Graph-based Orchestration for Reasoning and Assessment (AGORA), a flexible and abstraction framework that addresses these challenges.
arXiv Detail & Related papers (2025-05-30T08:46:23Z) - Review of Case-Based Reasoning for LLM Agents: Theoretical Foundations, Architectural Components, and Cognitive Integration [0.0]
Case-Based Reasoning (CBR) is a strategy that solves new problems by referencing past experiences.<n>This paper explores how Case-Based Reasoning (CBR), a strategy that solves new problems by referencing past experiences, can be integrated into Large Language Models.
arXiv Detail & Related papers (2025-04-09T14:51:02Z) - Measurement of LLM's Philosophies of Human Nature [113.47929131143766]
We design the standardized psychological scale specifically targeting large language models (LLM)
We show that current LLMs exhibit a systemic lack of trust in humans.
We propose a mental loop learning framework, which enables LLM to continuously optimize its value system.
arXiv Detail & Related papers (2025-04-03T06:22:19Z) - Large Language Model Agent: A Survey on Methodology, Applications and Challenges [88.3032929492409]
Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence.<n>This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy.<n>Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time.
arXiv Detail & Related papers (2025-03-27T12:50:17Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - Analyzing Advanced AI Systems Against Definitions of Life and Consciousness [0.0]
We propose a number of metrics for examining whether an advanced AI system has gained consciousness.
We suggest that sufficiently advanced architectures exhibiting immune like sabotage defenses, mirror self-recognition analogs, or meta-cognitive updates may cross key thresholds akin to life-like or consciousness-like traits.
arXiv Detail & Related papers (2025-02-07T15:27:34Z) - Meta-Reflection: A Feedback-Free Reflection Learning Framework [57.14485943991588]
We propose Meta-Reflection, a feedback-free reflection mechanism that requires only a single inference pass without external feedback.
Motivated by the human ability to remember and retrieve reflections from past experiences, Meta-Reflection integrates reflective insights into a codebook.
To thoroughly investigate and evaluate the practicality of Meta-Reflection in real-world scenarios, we introduce an industrial e-commerce benchmark named E-commerce Customer Intent Detection.
arXiv Detail & Related papers (2024-12-18T12:20:04Z) - Imagining and building wise machines: The centrality of AI metacognition [78.76893632793497]
We argue that shortcomings stem from one overarching failure: AI systems lack wisdom.
While AI research has focused on task-level strategies, metacognition is underdeveloped in AI systems.
We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety.
arXiv Detail & Related papers (2024-11-04T18:10:10Z) - Probing Ranking LLMs: A Mechanistic Analysis for Information Retrieval [20.353393773305672]
We employ a probing-based analysis to examine neuron activations in ranking LLMs.<n>Our study spans a broad range of feature categories, including lexical signals, document structure, query-document interactions, and complex semantic representations.<n>Our findings offer crucial insights for developing more transparent and reliable retrieval systems.
arXiv Detail & Related papers (2024-10-24T08:20:10Z) - X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection [55.77552681618732]
X2-DFD is an eXplainable and eXtendable framework based on multimodal large-language models (MLLMs) for deepfake detection.<n>The first stage, Model Feature Assessment, systematically evaluates the detectability of forgery-related features for the MLLM.<n>The second stage, Explainable dataset Construction, consists of two key modules: Strong Feature Strengthening and Weak Feature Supplementing.<n>The third stage, Fine-tuning and Inference, involves fine-tuning the MLLM on the constructed dataset and deploying it for final detection and explanation.
arXiv Detail & Related papers (2024-10-08T15:28:33Z) - RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models [5.0741409008225755]
Large language models (LLMs) have emerged as promising tools for solving challenging robotic tasks.
Most existing LLM-based agents lack the ability to retain and learn from past interactions.
We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents' decisions.
arXiv Detail & Related papers (2024-09-18T20:03:32Z) - Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making.
Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations.
Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z) - Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants.
This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation.
We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - Enabling High-Level Machine Reasoning with Cognitive Neuro-Symbolic
Systems [67.01132165581667]
We propose to enable high-level reasoning in AI systems by integrating cognitive architectures with external neuro-symbolic components.
We illustrate a hybrid framework centered on ACT-R and we discuss the role of generative models in recent and future applications.
arXiv Detail & Related papers (2023-11-13T21:20:17Z) - Large Language Models Understand and Can be Enhanced by Emotional
Stimuli [53.53886609012119]
We take the first step towards exploring the ability of Large Language Models to understand emotional stimuli.
Our experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts.
Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks.
arXiv Detail & Related papers (2023-07-14T00:57:12Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities [19.83434949066066]
This paper introduces a novel intelligent framework, referred to as OlaGPT.
OlaGPT carefully studied a cognitive architecture framework, and propose to simulate certain aspects of human cognition.
The framework involves approximating different cognitive modules, including attention, memory, reasoning, learning, and corresponding scheduling and decision-making mechanisms.
arXiv Detail & Related papers (2023-05-23T09:36:51Z) - Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology.
We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table.
It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z) - Reflective Artificial Intelligence [2.7412662946127755]
Many important qualities that a human mind would have previously brought to the activity are utterly absent in AI.
One core feature that humans bring to tasks is reflection.
Yet this capability is utterly missing from current mainstream AI.
In this paper we ask what reflective AI might look like.
arXiv Detail & Related papers (2023-01-25T20:50:26Z) - Beyond Interpretable Benchmarks: Contextual Learning through Cognitive
and Multimodal Perception [0.0]
This study contends that the Turing Test is misinterpreted as an attempt to anthropomorphize computer systems.
It emphasizes tacit learning as a cornerstone of general-purpose intelligence, despite its lack of overt interpretability.
arXiv Detail & Related papers (2022-12-04T08:30:04Z) - Building Human-like Communicative Intelligence: A Grounded Perspective [1.0152838128195465]
After making astounding progress in language learning, AI systems seem to approach the ceiling that does not reflect important aspects of human communicative capacities.
This paper suggests that the dominant cognitively-inspired AI directions, based on nativist and symbolic paradigms, lack necessary substantiation and concreteness to guide progress in modern AI.
I propose a list of concrete, implementable components for building "grounded" linguistic intelligence.
arXiv Detail & Related papers (2022-01-02T01:43:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.