Related papers: Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science?

Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science?

URL: http://arxiv.org/abs/2401.15081v1
Date: Sun, 7 Jan 2024 12:36:31 GMT
Title: Can generative AI and ChatGPT outperform humans on cognitive-demanding problem-solving tasks in science?
Authors: Xiaoming Zhai, Matthew Nyaaba, and Wenchao Ma
Abstract summary: This study compared the performance of ChatGPT and GPT-4 on 2019 NAEP science assessments with students by cognitive demands of the items. Results showed that both ChatGPT and GPT-4 consistently outperformed most students who answered the NAEP science assessments.
Score: 1.1172147007388977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study aimed to examine an assumption that generative artificial intelligence (GAI) tools can overcome the cognitive intensity that humans suffer when solving problems. We compared the performance of ChatGPT and GPT-4 on 2019 NAEP science assessments with students by cognitive demands of the items. Fifty-four tasks were coded by experts using a two-dimensional cognitive load framework, including task cognitive complexity and dimensionality. ChatGPT and GPT-4 responses were scored using the scoring keys of NAEP. The analysis of the available data was based on the average student ability scores for students who answered each item correctly and the percentage of students who responded to individual items. Results showed that both ChatGPT and GPT-4 consistently outperformed most students who answered the NAEP science assessments. As the cognitive demand for NAEP tasks increases, statistically higher average student ability scores are required to correctly address the questions. This pattern was observed for students in grades 4, 8, and 12, respectively. However, ChatGPT and GPT-4 were not statistically sensitive to the increase in cognitive demands of the tasks, except for Grade 4. As the first study focusing on comparing GAI and K-12 students in problem-solving in science, this finding implies the need for changes to educational objectives to prepare students with competence to work with GAI tools in the future. Education ought to emphasize the cultivation of advanced cognitive skills rather than depending solely on tasks that demand cognitive intensity. This approach would foster critical thinking, analytical skills, and the application of knowledge in novel contexts. Findings also suggest the need for innovative assessment practices by moving away from cognitive intensity tasks toward creativity and analytical skills to avoid the negative effects of GAI on testing more efficiently.

Related papers

Investigating Large Language Models in Diagnosing Students' Cognitive Skills in Math Problem-solving [23.811625065982486]
We investigate how state-of-the-art large language models diagnose students' cognitive skills in mathematics. We constructed MathCog, a novel benchmark dataset comprising 639 student responses to 110 middle school math problems. Our evaluation reveals that even the state-of-the-art LLMs struggle with the task, all F1 scores below 0.5, and tend to exhibit strong false confidence for incorrect cases.
arXiv Detail & Related papers (2025-04-01T14:29:41Z)
Beyond Detection: Designing AI-Resilient Assessments with Automated Feedback Tool to Foster Critical Thinking [0.0]
This research proposes a proactive, AI-resilient solution based on assessment design rather than detection. It introduces a web-based Python tool that integrates Bloom's taxonomy with advanced natural language processing techniques. It helps educators determine whether a task targets lower-order thinking such as recall and summarization or higher-order skills such as analysis, evaluation, and creation.
arXiv Detail & Related papers (2025-03-30T23:13:00Z)
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z)
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning [78.42927884000673]
ExACT is an approach to combine test-time search and self-learning to build o1-like models for agentic applications. We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly. Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms.
arXiv Detail & Related papers (2024-10-02T21:42:35Z)
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [175.9723801486487]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions. GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions. Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z)
GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing [74.68232970965595]
Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks.
arXiv Detail & Related papers (2024-03-09T13:56:25Z)
Student Mastery or AI Deception? Analyzing ChatGPT's Assessment Proficiency and Evaluating Detection Strategies [1.633179643849375]
Generative AI systems such as ChatGPT have a disruptive effect on learning and assessment. This work investigates the performance of ChatGPT by evaluating it across three courses.
arXiv Detail & Related papers (2023-11-27T20:10:13Z)
Large Language Models Understand and Can be Enhanced by Emotional Stimuli [53.53886609012119]
We take the first step towards exploring the ability of Large Language Models to understand emotional stimuli. Our experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts. Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks.
arXiv Detail & Related papers (2023-07-14T00:57:12Z)
Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues [2.3361634876233817]
Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings. The accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback.
arXiv Detail & Related papers (2023-07-05T04:14:01Z)
Game of Tones: Faculty detection of GPT-4 generated content in university assessments [0.0]
This study explores the robustness of university assessments against the use of Open AI's Gene-Trained Transformer. It evaluates the ability of academic staff to detect its use when supported by Artificial Intelligence (AI) detection tool.
arXiv Detail & Related papers (2023-05-29T13:31:58Z)
Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data. We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more. We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z)
Mind meets machine: Unravelling GPT-4's cognitive psychology [0.7302002320865727]
Large language models (LLMs) are emerging as potent tools increasingly capable of performing human-level tasks. This study focuses on the evaluation of GPT-4's performance on datasets such as CommonsenseQA, SuperGLUE, MATH and HANS. We show that GPT-4 exhibits a high level of accuracy in cognitive psychology tasks relative to the prior state-of-the-art models.
arXiv Detail & Related papers (2023-03-20T20:28:26Z)
ChatGPT: Jack of all trades, master of none [4.693597927153063]
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) We examined ChatGPT's capabilities on 25 diverse analytical NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses.
arXiv Detail & Related papers (2023-02-21T15:20:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.