Related papers: From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

URL: http://arxiv.org/abs/2508.10005v1
Date: Tue, 05 Aug 2025 14:16:42 GMT
Title: From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation
Authors: Chengliang Zhou, Mei Wang, Ting Zhang, Qiannan Zhu, Jian Li, Hua Huang,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving.<n>We introduce EQGBench, a benchmark specifically designed for evaluating LLMs' performance in Chinese Educational Question Generation.<n>The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios.
Score: 30.57730587890455
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs' performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students' comprehensive abilities.

Related papers

MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation [32.76585750014007]
We propose the Educational Question Planning with self-Reflection (EQPR) method for educational mathematical question generation.<n>By combining planning algorithm based on Monte Carlo Tree Search with the generative capabilities of Large Language Models, we continuously optimize questions.<n>We have demonstrated that EQPR achieves significant improvements in generating questions that meet multi-dimensional educational objectives.
arXiv Detail & Related papers (2025-06-01T11:23:18Z)
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models [50.43793764203352]
We introduce MDK12-Bench, a multi-disciplinary benchmark assessing the reasoning capabilities of MLLMs via real-world K-12 examinations.<n>Our benchmark comprises 140K reasoning instances across diverse difficulty levels from primary school to 12th grade.<n>It features 6,827 instance-level knowledge point annotations based on a well-organized knowledge structure, detailed answer explanations, difficulty labels and cross-year partitions.
arXiv Detail & Related papers (2025-04-08T08:06:53Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data [31.324617466692754]
CJEval is a benchmark based on Chinese Junior High School Exam Evaluations. It consists of 26,136 samples across four application-level educational tasks covering ten subjects. By utilizing this benchmark, we assessed LLMs' potential applications and conducted a comprehensive analysis of their performance.
arXiv Detail & Related papers (2024-09-24T16:00:28Z)
Research on the Application of Large Language Models in Automatic Question Generation: A Case Study of ChatGLM in the Context of High School Information Technology Curriculum [3.0753648264454547]
The model is guided to generate diverse questions, which are then comprehensively evaluated by domain experts. The results indicate that ChatGLM outperforms human-generated questions in terms of clarity and teachers' willingness to use.
arXiv Detail & Related papers (2024-08-21T11:38:32Z)
Application of Large Language Models in Automated Question Generation: A Case Study on ChatGLM's Structured Questions for National Teacher Certification Exams [2.7363336723930756]
This study explores the application potential of the large language models (LLMs) ChatGLM in the automatic generation of structured questions for National Teacher Certification Exams (NTCE) We guided ChatGLM to generate a series of simulated questions and conducted a comprehensive comparison with questions recollected from past examinees. The research results indicate that the questions generated by ChatGLM exhibit a high level of rationality, scientificity, and practicality similar to those of the real exam questions.
arXiv Detail & Related papers (2024-08-19T13:32:14Z)
Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation [0.0]
We examine the ability of five state-of-the-art large language models to generate diverse and high-quality questions of different cognitive levels. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information.
arXiv Detail & Related papers (2024-08-08T11:56:57Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.<n>Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.<n>We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges [60.62904929065257]
Large language models (LLMs) offer possibility for resolving this issue by comprehending individual requests. This paper reviews the recently emerged LLM research related to educational capabilities, including mathematics, writing, programming, reasoning, and knowledge-based question answering.
arXiv Detail & Related papers (2023-12-27T14:37:32Z)
Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.