Related papers: Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation

Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation

URL: http://arxiv.org/abs/2601.14798v1
Date: Wed, 21 Jan 2026 09:23:11 GMT
Title: Reflecting in the Reflection: Integrating a Socratic Questioning Framework into Automated AI-Based Question Generation
Authors: Ondřej Holub, Essi Ryymin, Rodrigo Alves,
Abstract summary: This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs)<n>Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question.<n>We show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline.
Score: 10.05797775116765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Designing good reflection questions is pedagogically important but time-consuming and unevenly supported across teachers. This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs). Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question given a teacher-specified topic, key concepts, student level, and optional instructional materials. The Student-Teacher proposes candidate questions with brief rationales, while the Teacher-Educator evaluates them along clarity, depth, relevance, engagement, and conceptual interconnections, responding only with targeted coaching questions or a fixed signal to stop the dialogue. We evaluate the framework in an authentic lower-secondary ICT setting on the topic, using GPT-4o-mini as the backbone model and a stronger GPT- 4-class LLM as an external evaluator in pairwise comparisons of clarity, relevance, depth, and overall quality. First, we study how interaction design and context (dynamic vs.fixed iteration counts; presence or absence of student level and materials) affect question quality. Dynamic stopping combined with contextual information consistently outperforms fixed 5- or 10-step refinement, with very long dialogues prone to drift or over-complication. Second, we show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline using the same backbone model.

Related papers

EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus [59.693733170193944]
We present EduDial, a comprehensive multi-turn teacher-student dialogue dataset.<n>EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents.
arXiv Detail & Related papers (2025-10-14T18:18:43Z)
How Real Is AI Tutoring? Comparing Simulated and Human Dialogues in One-on-One Instruction [6.649393350057383]
This study systematically investigates the structural and behavioral differences between AI-simulated and authentic human tutoring dialogues.<n>Results show that human dialogues are significantly superior to their AI counterparts in utterance length, as well as in questioning (I-Q) and general feedback (F-F) behaviors.
arXiv Detail & Related papers (2025-09-02T03:18:39Z)
Automatic Question & Answer Generation Using Generative Large Language Model (LLM) [0.0]
This research proposes to leverage unsupervised learning methods in NLP, primarily focusing on the English language.<n>A customized model will offer efficient solutions for educators, instructors, and individuals engaged in text-based evaluations.
arXiv Detail & Related papers (2025-08-26T23:36:13Z)
\textsc{SimInstruct}: A Responsible Tool for Collecting Scaffolding Dialogues Between Experts and LLM-Simulated Novices [21.67295740032255]
SimInstruct is a scalable, expert-in-the-loop tool for collecting scaffolding dialogues.<n>Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs.<n>Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage.
arXiv Detail & Related papers (2025-08-06T13:16:10Z)
"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF) [41.09752906121257]
We propose the Multimodal Short Answer grading with Feedback (MMSAF) problem along with a dataset of 2,197 data points.<n>As per our evaluations, existing Multimodal Large Language Models (MLLMs) could predict whether an answer is correct, incorrect or partially correct with an accuracy of 55%.<n>Similarly, they could predict whether the image provided in the student's answer is relevant or not with an accuracy of 75%.
arXiv Detail & Related papers (2024-12-27T17:33:39Z)
"My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays [6.810086342993699]
This paper introduces CAELF, a Contestable AI Empowered LLM Framework for automating interactive feedback. CAELF allows students to query, challenge, and clarify their feedback by integrating a multi-agent system with computational argumentation. A case study on 500 critical thinking essays with user studies demonstrates that CAELF significantly improves interactive feedback.
arXiv Detail & Related papers (2024-09-11T17:59:01Z)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning [43.83422798569986]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable form of assessment. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers. We propose a simple, in-context learning-based solution for automated distractor and corresponding feedback message generation.
arXiv Detail & Related papers (2023-08-07T01:03:04Z)
Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment [75.59538732476346]
We focus on the problem of generating such gap-focused questions (GFQs) automatically. We define the task, highlight key desired aspects of a good GFQ, and propose a model that satisfies these.
arXiv Detail & Related papers (2023-07-06T22:21:42Z)
FCC: Fusing Conversation History and Candidate Provenance for Contextual Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels. We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z)
Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange. This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z)
Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues [88.73739515457116]
We introduce four self-supervised tasks including next session prediction, utterance restoration, incoherence detection and consistency discrimination. We jointly train the PLM-based response selection model with these auxiliary tasks in a multi-task manner. Experiment results indicate that the proposed auxiliary self-supervised tasks bring significant improvement for multi-turn response selection.
arXiv Detail & Related papers (2020-09-14T08:44:46Z)
Neural Multi-Task Learning for Teacher Question Detection in Online Classrooms [50.19997675066203]
We build an end-to-end neural framework that automatically detects questions from teachers' audio recordings. By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions.
arXiv Detail & Related papers (2020-05-16T02:17:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.