Related papers: ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations

ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations

URL: http://arxiv.org/abs/2506.14200v1
Date: Tue, 17 Jun 2025 05:36:39 GMT
Title: ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations
Authors: Brihi Joshi, Keyu He, Sahana Ramnath, Sadra Sabouri, Kaitlyn Zhou, Souti Chattopadhyay, Swabha Swayamdipta, Xiang Ren,
Abstract summary: We introduce ELI-Why, a benchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities of language models.<n>In our first study, human raters assume the role of an "educator" to assess model explanations' fit to different educational grades.<n>We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations.
Score: 38.73656006445607
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an "educator" to assess model explanations' fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.

Related papers

Benchmarking the Pedagogical Knowledge of Large Language Models [4.417539128489408]
This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their pedagogical knowledge.<n>These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers.<n>We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions.
arXiv Detail & Related papers (2025-06-23T14:49:01Z)
Scenarios and Approaches for Situated Natural Language Explanations [18.022428746019582]
We collect a benchmarking dataset, Situation-Based Explanation. This dataset contains 100 explanandums. For each "explanandum paired with an audience" situation, we include a human-written explanation. We examine three categories of prompting methods: rule-based prompting, meta-prompting, and in-context learning prompting.
arXiv Detail & Related papers (2024-06-07T15:56:32Z)
Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes. We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function. Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
Assertion Enhanced Few-Shot Learning: Instructive Technique for Large Language Models to Generate Educational Explanations [0.0]
Human educators possess an intrinsic ability to anticipate and seek educational explanations from students. We aim to imbue Intelligent Tutoring Systems with this ability using few-shot learning capability of Large Language Models.
arXiv Detail & Related papers (2023-12-05T20:41:34Z)
Exploring Iterative Enhancement for Improving Learnersourced Multiple-Choice Question Explanations with Large Language Models [22.376741676039398]
We present and evaluate a framework called "ILearner-LLM" to scaffold the task of automated explanation generation.<n>The framework generates high-quality student-aligned explanations by iteratively feeding the quality rating score from the evaluation model back into the instruction prompt.<n>Our findings represent a promising path to enrich the learnersourcing experience for students.
arXiv Detail & Related papers (2023-09-19T09:04:15Z)
Assessing the efficacy of large language models in generating accurate teacher responses [0.5774786149181391]
This study attempts to assess the generative abilities of large language models in providing informative and helpful insights to students. We present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Our experimental findings on the Teacher-Student Chatroom subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT.
arXiv Detail & Related papers (2023-07-09T22:32:46Z)
Explanations from Large Language Models Make Small Reasoners Better [61.991772773700006]
We show that our method can consistently and significantly outperform finetuning baselines across different settings. As a side benefit, human evaluation shows that our method can generate high-quality explanations to justify its predictions.
arXiv Detail & Related papers (2022-10-13T04:50:02Z)
Human Interpretation of Saliency-based Explanation Over Text [65.29015910991261]
We study saliency-based explanations over textual data. We find that people often mis-interpret the explanations. We propose a method to adjust saliencies based on model estimates of over- and under-perception.
arXiv Detail & Related papers (2022-01-27T15:20:32Z)
Evaluating Explanations: How much do explanations from the teacher aid students? [103.05037537415811]
We formalize the value of explanations using a student-teacher paradigm that measures the extent to which explanations improve student models in learning. Unlike many prior proposals to evaluate explanations, our approach cannot be easily gamed, enabling principled, scalable, and automatic evaluation of attributions.
arXiv Detail & Related papers (2020-12-01T23:40:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.