Related papers: ChatGPT and Gemini participated in the Korean College Scholastic Ability Test -- Earth Science I

ChatGPT and Gemini participated in the Korean College Scholastic Ability Test -- Earth Science I

URL: http://arxiv.org/abs/2512.15298v1
Date: Wed, 17 Dec 2025 10:46:41 GMT
Title: ChatGPT and Gemini participated in the Korean College Scholastic Ability Test -- Earth Science I
Authors: Seok-Hyun Ga, Chun-Yen Chang,
Abstract summary: This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs)<n> Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures.<n>By exploiting AI's weaknesses, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that "Perception Errors" were dominant, highlighting a "Perception-Cognition Gap" where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a "Calculation-Conceptualization Discrepancy," successfully performing calculations while failing to apply the underlying scientific concepts, and "Process Hallucination," where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing "AI-resistant questions" that target these specific cognitive vulnerabilities. By exploiting AI's weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.

Related papers

Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework [0.0]
The rapid adoption of generative AI has undermined traditional modular assessments in computing education.<n>This paper presents a theoretically grounded framework for designing AI-resilient assessments.
arXiv Detail & Related papers (2025-12-11T15:53:19Z)
Assessment Twins: A Protocol for AI-Vulnerable Summative Assessment [0.0]
We introduce assessment twins as an accessible approach for redesigning assessment tasks to enhance validity.<n>We use Messick's unified validity framework to systematically map the ways in which GenAI threaten content, structural, consequential, generalisability, and external validity.<n>We argue that the twin approach helps mitigate validity threats by triangulating evidence across complementary formats.
arXiv Detail & Related papers (2025-10-03T12:05:34Z)
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark [71.3555284685426]
We introduce RealUnify, a benchmark designed to evaluate bidirectional capability synergy.<n>RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks.<n>We find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient.
arXiv Detail & Related papers (2025-09-29T15:07:28Z)
When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration [79.69935257008467]
We introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities.<n>We conduct the first large-scale human study (N=118) explicitly designed to measure it.<n>In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding.
arXiv Detail & Related papers (2025-06-05T20:48:16Z)
Dynamic Programming Techniques for Enhancing Cognitive Representation in Knowledge Tracing [125.75923987618977]
We propose the Cognitive Representation Dynamic Programming based Knowledge Tracing (CRDP-KT) model.<n>It is a dynamic programming algorithm to optimize cognitive representations based on the difficulty of the questions and the performance intervals between them.<n>It provides more accurate and systematic input features for subsequent model training, thereby minimizing distortion in the simulation of cognitive states.
arXiv Detail & Related papers (2025-06-03T14:44:48Z)
Distinguishing Fact from Fiction: Student Traits, Attitudes, and AI Hallucination Detection in Business School Assessment [2.3359837623080613]
We examine how academic skills, cognitive traits, and AI scepticism influence students' ability to detect factually incorrect AI-generated responses (hallucinations) in a high-stakes assessment at a UK business school.<n>We find that only 20% successfully identified the hallucination, with strong academic performance, interpretive skills thinking, writing proficiency, and AI scepticism emerging as key predictors.
arXiv Detail & Related papers (2025-05-28T18:39:57Z)
The Imitation Game for Educational AI [23.71250100390303]
We present a novel evaluation framework based on a two-phase Turing-like test.<n>In Phase 1, students provide open-ended responses to questions, revealing natural misconceptions.<n>In Phase 2, both AI and human experts, conditioned on each student's specific mistakes, generate distractors for new related questions.
arXiv Detail & Related papers (2025-02-21T01:14:55Z)
Learning to Generate and Evaluate Fact-checking Explanations with Transformers [10.970249299147866]
Research contributes to the field of Explainable Artificial Antelligence (XAI) We develop transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. We emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements.
arXiv Detail & Related papers (2024-10-21T06:22:51Z)
Evaluation of OpenAI o1: Opportunities and Challenges of AGI [100.85218639544654]
o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance.<n>The model excelled in tasks requiring intricate reasoning and knowledge integration across various fields.<n>Overall results indicate significant progress towards artificial general intelligence.
arXiv Detail & Related papers (2024-09-27T06:57:00Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models. First, we show that neural causal models (NCMs) are expressive enough. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.