Related papers: LLM Prompt Evaluation for Educational Applications

LLM Prompt Evaluation for Educational Applications

URL: http://arxiv.org/abs/2601.16134v1
Date: Thu, 22 Jan 2026 17:31:25 GMT
Title: LLM Prompt Evaluation for Educational Applications
Authors: Langdon Holmes, Adam Coscia, Scott Crossley, Joon Suh Choi, Wesley Morris,
Abstract summary: Large language models (LLMs) are increasingly common in educational applications.<n>There is a growing need for evidence-based methods to design and evaluate LLM prompts.<n>This study presents a generalizable, systematic approach for evaluating prompts.
Score: 2.1883807277376754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.

Related papers

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education [11.130206904690745]
We introduce EduEval, a comprehensive hierarchical benchmark for evaluating large language models (LLMs) in Chinese K-12 education.<n>EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels.<n>We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation.
arXiv Detail & Related papers (2025-11-29T03:09:50Z)
Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny [75.55915044740566]
Students in computing education increasingly use large language models (LLMs) such as ChatGPT.<n>This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny.
arXiv Detail & Related papers (2025-06-27T16:34:13Z)
An Empirical Study of Federated Prompt Learning for Vision Language Model [89.2963764404892]
This paper systematically investigates the behavioral differences between language prompt learning (VPT) and vision prompt learning (VLM)<n>We evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL)
arXiv Detail & Related papers (2025-05-29T03:09:15Z)
CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring [2.249916681499244]
Chain-of-Thought Prompting + Active Learning (CoTAL) is an Evidence-Centered Design (ECD)-based approach to formative assessment scoring.<n>Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains.
arXiv Detail & Related papers (2025-04-03T06:53:34Z)
From Prompts to Templates: A Systematic Prompt Template Analysis for Real-world LLMapps [20.549178260624043]
Large Language Models (LLMs) have revolutionized human-AI interaction by enabling intuitive task execution through natural language prompts.<n>Small variations in structure or wording can result in substantial differences in output.<n>This paper presents a comprehensive analysis of prompt templates in practical LLMapps.
arXiv Detail & Related papers (2025-04-02T18:20:06Z)
Use Me Wisely: AI-Driven Assessment for LLM Prompting Skills Development [5.559706293891474]
Large language model (LLM)-powered chatbots have become popular across various domains, supporting a range of tasks and processes.<n>Yet, prompting is highly task- and domain-dependent, limiting the effectiveness of generic approaches.<n>In this study, we explore whether LLM-based methods can facilitate learning assessments by using ad-hoc guidelines and a minimal number of annotated prompt samples.
arXiv Detail & Related papers (2025-03-04T11:56:33Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs [49.18567856499736]
We investigate whether large language models (LLMs) can be supportive of open-ended dialogue tutoring.<n>We apply a range of knowledge tracing (KT) methods on the resulting labeled data to track student knowledge levels over an entire dialogue.<n>We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues.
arXiv Detail & Related papers (2024-09-24T22:31:39Z)
Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models [14.405446719317291]
Existing debiasing techniques are typically training-based or require access to the model's internals and output distributions. We evaluate a comprehensive end-user-focused iterative framework of debiasing that applies System 2 thinking processes for prompts to induce logical, reflective, and critical text generation.
arXiv Detail & Related papers (2024-05-16T20:27:58Z)
Efficient Prompting Methods for Large Language Models: A Survey [50.82812214830023]
Efficient Prompting Methods have attracted a wide range of attention.<n>We discuss Automatic Prompt Engineering for different prompt components and Prompt Compression in continuous and discrete spaces.
arXiv Detail & Related papers (2024-04-01T12:19:08Z)
Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases. We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z)
Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs) Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z)
TEMPERA: Test-Time Prompting via Reinforcement Learning [57.48657629588436]
We propose Test-time Prompt Editing using Reinforcement learning (TEMPERA) In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
arXiv Detail & Related papers (2022-11-21T22:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.