Related papers: Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

URL: http://arxiv.org/abs/2308.06088v1
Date: Fri, 11 Aug 2023 12:03:12 GMT
Title: Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters
Authors: Arne Bewersdorff, Kathrin Se{\ss}ler, Armin Baur, Enkelejda Kasneci, Claudia Nerdel
Abstract summary: We investigate the potential of Large Language Models (LLMs) for automatically identifying student errors. An AI system based on the GPT-3.5 and GPT-4 series was developed and tested against human raters. Our results indicate varying levels of accuracy in error detection between the AI system and human raters.
Score: 9.899633398596672
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Identifying logical errors in complex, incomplete or even contradictory and overall heterogeneous data like students' experimentation protocols is challenging. Recognizing the limitations of current evaluation methods, we investigate the potential of Large Language Models (LLMs) for automatically identifying student errors and streamlining teacher assessments. Our aim is to provide a foundation for productive, personalized feedback. Using a dataset of 65 student protocols, an Artificial Intelligence (AI) system based on the GPT-3.5 and GPT-4 series was developed and tested against human raters. Our results indicate varying levels of accuracy in error detection between the AI system and human raters. The AI system can accurately identify many fundamental student errors, for instance, the AI system identifies when a student is focusing the hypothesis not on the dependent variable but solely on an expected observation (acc. = 0.90), when a student modifies the trials in an ongoing investigation (acc. = 1), and whether a student is conducting valid test trials (acc. = 0.82) reliably. The identification of other, usually more complex errors, like whether a student conducts a valid control trial (acc. = .60), poses a greater challenge. This research explores not only the utility of AI in educational settings, but also contributes to the understanding of the capabilities of LLMs in error detection in inquiry-based learning like experimentation.

Related papers

Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead [2.809966405091883]
We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification.<n>We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems.
arXiv Detail & Related papers (2025-07-30T18:14:35Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework [8.017827642932746]
Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets is a modality-agnostic dataset auditing framework.<n>Our method examines the relationship between task-level annotations and data properties including patient attributes.<n>G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods.
arXiv Detail & Related papers (2025-03-13T02:16:48Z)
The Imitation Game for Educational AI [23.71250100390303]
We present a novel evaluation framework based on a two-phase Turing-like test. In Phase 1, students provide open-ended responses to questions, revealing natural misconceptions. In Phase 2, both AI and human experts, conditioned on each student's specific mistakes, generate distractors for new related questions.
arXiv Detail & Related papers (2025-02-21T01:14:55Z)
LLM-based Cognitive Models of Students with Misconceptions [55.29525439159345]
This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement. We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns. Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
arXiv Detail & Related papers (2024-10-16T06:51:09Z)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. LLMs struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z)
Beyond human subjectivity and error: a novel AI grading system [67.410870290301]
The grading of open-ended questions is a high-effort, high-impact task in education. Recent breakthroughs in AI technology might facilitate such automation, but this has not been demonstrated at scale. We introduce a novel automatic short answer grading (ASAG) system.
arXiv Detail & Related papers (2024-05-07T13:49:59Z)
Determining the Difficulties of Students With Dyslexia via Virtual Reality and Artificial Intelligence: An Exploratory Analysis [0.0]
The VRAIlexia project has been created to tackle this issue by proposing two different tools. The first one has been created and is being distributed among dyslexic students in Higher Education Institutions, for the conduction of specific psychological and psychometric tests. The second tool applies specific artificial intelligence algorithms to the data gathered via the application and other surveys.
arXiv Detail & Related papers (2024-01-15T20:26:09Z)
Student Mastery or AI Deception? Analyzing ChatGPT's Assessment Proficiency and Evaluating Detection Strategies [1.633179643849375]
Generative AI systems such as ChatGPT have a disruptive effect on learning and assessment. This work investigates the performance of ChatGPT by evaluating it across three courses.
arXiv Detail & Related papers (2023-11-27T20:10:13Z)
Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models. First, we show that neural causal models (NCMs) are expressive enough. Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z)
Cognitive Diagnosis with Explicit Student Vector Estimation and Unsupervised Question Matrix Learning [53.79108239032941]
We propose an explicit student vector estimation (ESVE) method to estimate the student vectors of DINA. We also propose an unsupervised method called bidirectional calibration algorithm (HBCA) to label the Q-matrix automatically. The experimental results on two real-world datasets show that ESVE-DINA outperforms the DINA model on accuracy and that the Q-matrix labeled automatically by HBCA can achieve performance comparable to that obtained with the manually labeled Q-matrix.
arXiv Detail & Related papers (2022-03-01T03:53:19Z)
Autonomous Reinforcement Learning: Formalism and Benchmarking [106.25788536376007]
Real-world embodied learning, such as that performed by humans and animals, is situated in a continual, non-episodic world. Common benchmark tasks in RL are episodic, with the environment resetting between trials to provide the agent with multiple attempts. This discrepancy presents a major challenge when attempting to take RL algorithms developed for episodic simulated environments and run them on real-world platforms.
arXiv Detail & Related papers (2021-12-17T16:28:06Z)
KANDINSKYPatterns -- An experimental exploration environment for Pattern Analysis and Machine Intelligence [0.0]
We present KANDINSKYPatterns, named after the Russian artist Wassily Kandinksy, who made theoretical contributions to compositivity, i.e. that all perceptions consist of geometrically elementary individual components. KANDINSKYPatterns have computationally controllable properties on the one hand, bringing ground truth, they are also easily distinguishable by human observers, i.e., controlled patterns can be described by both humans and algorithms.
arXiv Detail & Related papers (2021-02-28T14:09:59Z)
Challenging common interpretability assumptions in feature attribution explanations [0.0]
We empirically evaluate the veracity of three common interpretability assumptions through a large scale human-subjects experiment. We find that feature attribution explanations provide marginal utility in our task for a human decision maker.
arXiv Detail & Related papers (2020-12-04T17:57:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.