AI, Metacognition, and the Verification Bottleneck: A Three-Wave Longitudinal Study of Human Problem-Solving
- URL: http://arxiv.org/abs/2601.17055v1
- Date: Wed, 21 Jan 2026 15:49:04 GMT
- Title: AI, Metacognition, and the Verification Bottleneck: A Three-Wave Longitudinal Study of Human Problem-Solving
- Authors: Matthias Huemmer, Franziska Durner, Theophile Shyiramunda, Michelle J. Cummings-Koether,
- Abstract summary: This pilot study tracked how generative AI reshapes problem-solving over six months in an academic setting.<n>Results generalize primarily to early-adopter, academically affiliated populations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This longitudinal pilot study tracked how generative AI reshapes problem-solving over six months across three waves in an academic setting. AI integration reached saturation by Wave 3, with daily use rising from 52.4% to 95.7% and ChatGPT adoption from 85.7% to 100%. A dominant hybrid workflow increased 2.7-fold, adopted by 39.1% of participants. The verification paradox emerged: participants relied most heavily on AI for difficult tasks (73.9%) yet showed declining verification confidence (68.1%) where performance was worst (47.8% accuracy on complex tasks). Objective performance declined systematically: 95.2% to 81.0% to 66.7% to 47.8% across problem difficulty, with belief-performance gaps widening to 34.6 percentage points. This indicates a fundamental shift where verification, not solution generation, became the bottleneck in human-AI problem-solving. The ACTIVE Framework synthesizes findings grounded in cognitive load theory: Awareness and task-AI alignment, Critical verification protocols, Transparent human-in-the-loop integration, Iterative skill development countering cognitive offloading, Verification confidence calibration, and Ethical evaluation. The authors provide implementation pathways for institutions and practitioners. Key limitations include sample homogeneity (academic cohort only, convenience sampling) limiting generalizability to corporate, clinical, or regulated professional contexts; self-report bias in confidence measures (32.2 percentage point divergence from objective performance); lack of control conditions; restriction to mathematical/analytical problems; and insufficient timeframe to assess long-term skill trajectories. Results generalize primarily to early-adopter, academically affiliated populations. Causal validation requires randomized controlled trials.
Related papers
- Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance [4.424336158797069]
This paper compares five popular AI-powered coding assistants (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code)<n>Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks)<n>Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates.
arXiv Detail & Related papers (2026-02-09T17:14:46Z) - Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z) - ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models [67.15960154375131]
Large reasoning models (LRMs) extend large language models with explicit multi-step reasoning traces.<n>This capability introduces a new class of prompt-induced inference-time denial-of-service (PI-DoS) attacks that exploit the high computational cost of reasoning.<n>We present ReasoningBomb, a reinforcement-learning-based PI-DoS framework that is guided by a constant-time surrogate reward.
arXiv Detail & Related papers (2026-01-29T18:53:01Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - On the Influence of Artificial Intelligence on Human Problem-Solving: Empirical Insights for the Third Wave in a Multinational Longitudinal Pilot Study [0.0]
This article investigates the evolving paradigm of human-AI collaboration in problem-solving contexts.<n>Building upon previous waves, our findings reveal the consolidation of a hybrid problem-solving culture.<n>The study concludes that educational and technological interventions must prioritize verification scaffolds.
arXiv Detail & Related papers (2025-11-13T10:20:07Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Bias in the Loop: How Humans Evaluate AI-Generated Suggestions [9.578382668831988]
Human-AI collaboration increasingly drives decision-making across industries, from medical diagnosis to content moderation.<n>We know little about the psychological factors that determine when these collaborations succeed or fail.<n>We conducted a randomized experiment with 2,784 participants to examine how task design and individual characteristics shape human responses to AI-generated suggestions.
arXiv Detail & Related papers (2025-09-10T11:43:29Z) - A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks [0.0]
Confidence-diversity calibration is a quality assessment framework for accessible coding tasks.<n>Analysing 5,680 coding decisions from eight state-of-the-art LLMs, we find that mean self-confidence tracks inter-model agreement closely.
arXiv Detail & Related papers (2025-08-04T03:47:10Z) - Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait [70.00430652562012]
FarSight is an end-to-end system for person recognition that integrates biometric cues across face, gait, and body shape modalities.<n>FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion.
arXiv Detail & Related papers (2025-05-07T17:58:25Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Generative AI for Requirements Engineering: A Systematic Literature Review [1.6986294649170766]
Generative pretrained transformer models dominate current applications.<n>Industrial adoption remains nascent, with over 90% of studies corresponding to early stage development.<n>Despite the transformative potential of GenAI based RE, several barriers hinder practical adoption.
arXiv Detail & Related papers (2024-09-10T02:44:39Z) - Biomedical image analysis competitions: The state of current
participation practice [143.52578599912326]
We designed a survey to shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis.
The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics.
Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures.
arXiv Detail & Related papers (2022-12-16T16:44:46Z) - Detecting cognitive decline using speech only: The ADReSSo Challenge [10.497861245133086]
The ADReSSo Challenge targets three difficult automatic prediction problems of societal and medical relevance.
This paper presents these prediction tasks in detail, describes the datasets used, and reports the results of the baseline classification and regression models we developed for each task.
arXiv Detail & Related papers (2021-03-23T01:09:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.