How Adding Metacognitive Requirements in Support of AI Feedback in Practice Exams Transforms Student Learning Behaviors
- URL: http://arxiv.org/abs/2505.13381v1
- Date: Mon, 19 May 2025 17:25:07 GMT
- Title: How Adding Metacognitive Requirements in Support of AI Feedback in Practice Exams Transforms Student Learning Behaviors
- Authors: Mak Ahmad, Prerna Ravi, David Karger, Marc Facciotti,
- Abstract summary: We present an empirically evaluated practice exam system that integrates AI generated feedback with targeted textbook references.<n>Our system encourages metacognitive behavior by asking students to explain their answers and declare their confidence.<n>It uses OpenAI's GPT-4o to generate personalized feedback based on this information, while directing them to relevant textbook sections.
- Score: 1.3285222309805063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Providing personalized, detailed feedback at scale in large undergraduate STEM courses remains a persistent challenge. We present an empirically evaluated practice exam system that integrates AI generated feedback with targeted textbook references, deployed in a large introductory biology course. Our system encourages metacognitive behavior by asking students to explain their answers and declare their confidence. It uses OpenAI's GPT-4o to generate personalized feedback based on this information, while directing them to relevant textbook sections. Through interaction logs from consenting participants across three midterms (541, 342, and 413 students respectively), totaling 28,313 question-student interactions across 146 learning objectives, along with 279 surveys and 23 interviews, we examined the system's impact on learning outcomes and engagement. Across all midterms, feedback types showed no statistically significant performance differences, though some trends suggested potential benefits. The most substantial impact came from the required confidence ratings and explanations, which students reported transferring to their actual exam strategies. About 40 percent of students engaged with textbook references when prompted by feedback -- far higher than traditional reading rates. Survey data revealed high satisfaction (mean rating 4.1 of 5), with 82.1 percent reporting increased confidence on practiced midterm topics, and 73.4 percent indicating they could recall and apply specific concepts. Our findings suggest that embedding structured reflection requirements may be more impactful than sophisticated feedback mechanisms.
Related papers
- Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z) - Existing Large Language Model Unlearning Evaluations Are Inconclusive [105.55899615056573]
We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
arXiv Detail & Related papers (2025-05-31T19:43:00Z) - The Potential of Answer Classes in Large-scale Written Computer-Science Exams -- Vol. 2 [0.0]
In teacher training for secondary education, assessment guidelines are mandatory for every exam.<n>We apply this concept to a university exam with 462 students and 41 tasks.<n>For each task, instructors developed answer classes -- classes of expected responses.
arXiv Detail & Related papers (2024-12-12T10:20:39Z) - Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [176.39275404745098]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.<n>GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.<n>Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z) - "I understand why I got this grade": Automatic Short Answer Grading with Feedback [33.63970664152288]
We introduce Engineering Short Answer Feedback (EngSAF), a dataset designed for automatic short-answer grading with feedback.<n>We incorporate feedback into our dataset by leveraging the generative capabilities of state-of-the-art large language models (LLMs) using our Label-Aware Synthetic Feedback Generation (LASFG) strategy.<n>The best-performing model (Mistral-7B) achieves an overall accuracy of 75.4% and 58.7% on unseen answers and unseen question test sets, respectively.
arXiv Detail & Related papers (2024-06-30T15:42:18Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT [7.273857543125784]
Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms.
We employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data.
We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings.
arXiv Detail & Related papers (2024-04-01T16:58:09Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - The Study of Peer Assessment Impact on Group Learning Activities [2.045495982086173]
This research work provides a complete and systematic review, increase the practice and quality of the peer assessment process.
The results show 37% student will choose individual work over group work if given the choice.
In the case study, 82.1% of the total of 28 students have en-joyed working in a group using Facebook as communication tools.
arXiv Detail & Related papers (2022-01-07T06:54:18Z) - SIFN: A Sentiment-aware Interactive Fusion Network for Review-based Item
Recommendation [48.1799451277808]
We propose a Sentiment-aware Interactive Fusion Network (SIFN) for review-based item recommendation.
We first encode user/item reviews via BERT and propose a light-weighted sentiment learner to extract semantic features of each review.
Then, we propose a sentiment prediction task that guides the sentiment learner to extract sentiment-aware features via explicit sentiment labels.
arXiv Detail & Related papers (2021-08-18T08:04:38Z) - Supporting Cognitive and Emotional Empathic Writing of Students [15.125096968712063]
We present an annotation approach to capturing emotional and cognitive empathy in student-written peer reviews on business models in German.
We trained predictive models to detect the annotated empathy structures and embedded them in an adaptive writing support system.
We evaluated our tool in a peer learning exercise with 58 students and found promising results for perceived empathy skill learning, perceived feedback accuracy, and intention to use.
arXiv Detail & Related papers (2021-05-31T09:18:50Z) - Deep Discourse Analysis for Generating Personalized Feedback in
Intelligent Tutor Systems [4.716555240531893]
We explore creating automated, personalized feedback in an intelligent tutoring system (ITS)
Our goal is to pinpoint correct and incorrect concepts in student answers in order to achieve better student learning gains.
arXiv Detail & Related papers (2021-03-13T20:33:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.