Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
- URL: http://arxiv.org/abs/2408.10947v1
- Date: Tue, 20 Aug 2024 15:36:30 GMT
- Title: Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
- Authors: Yuyan Chen, Chenwei Wu, Songzhou Yan, Panjun Liu, Haoyu Zhou, Yanghua Xiao,
- Abstract summary: This study introduces a benchmark to evaluate the questioning capability in education as a teacher of large language models (LLMs)
We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs.
Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher.
- Score: 30.759154473275043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
Related papers
- Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences [11.576679362717478]
This study focuses on language learning as a context for modeling virtual student agents.
By curating a dataset of personalized teacher-student interactions with various personality traits, we conduct multi-dimensional evaluation experiments.
arXiv Detail & Related papers (2024-10-21T07:18:24Z) - Edu-Values: Towards Evaluating the Chinese Education Values of Large Language Models [9.761584874383873]
We present Edu-Values, the first Chinese education values evaluation benchmark designed to measure large language models' alignment ability.
We meticulously design and compile 1,418 questions, including multiple-choice, multi-modal question answering, subjective analysis, adversarial prompts, and questions on traditional Chinese culture.
Due to differences in educational culture, Chinese LLMs significantly outperform English LLMs, with Qwen 2 ranking the first with a score of 81.37.
arXiv Detail & Related papers (2024-09-19T13:02:54Z) - Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation [0.0]
We examine the ability of five state-of-the-art large language models to generate diverse and high-quality questions of different cognitive levels.
Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information.
arXiv Detail & Related papers (2024-08-08T11:56:57Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - Enhancing Instructional Quality: Leveraging Computer-Assisted Textual
Analysis to Generate In-Depth Insights from Educational Artifacts [13.617709093240231]
We examine how artificial intelligence (AI) and machine learning (ML) methods can analyze educational content, teacher discourse, and student responses to foster instructional improvement.
We identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development.
This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings.
arXiv Detail & Related papers (2024-03-06T18:29:18Z) - Evaluating and Optimizing Educational Content with Large Language Model Judgments [52.33701672559594]
We use Language Models (LMs) as educational experts to assess the impact of various instructions on learning outcomes.
We introduce an instruction optimization approach in which one LM generates instructional materials using the judgments of another LM as a reward function.
Human teachers' evaluations of these LM-generated worksheets show a significant alignment between the LM judgments and human teacher preferences.
arXiv Detail & Related papers (2024-03-05T09:09:15Z) - Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges [60.62904929065257]
Large language models (LLMs) offer possibility for resolving this issue by comprehending individual requests.
This paper reviews the recently emerged LLM research related to educational capabilities, including mathematics, writing, programming, reasoning, and knowledge-based question answering.
arXiv Detail & Related papers (2023-12-27T14:37:32Z) - Impact of Guidance and Interaction Strategies for LLM Use on Learner Performance and Perception [19.335003380399527]
Large language models (LLMs) offer a promising avenue, with increasing research exploring their educational utility.
Our work highlights the role that teachers can play in shaping LLM-supported learning environments.
arXiv Detail & Related papers (2023-10-13T01:21:52Z) - Exploring the Cognitive Knowledge Structure of Large Language Models: An
Educational Diagnostic Assessment Approach [50.125704610228254]
Large Language Models (LLMs) have not only exhibited exceptional performance across various tasks, but also demonstrated sparks of intelligence.
Recent studies have focused on assessing their capabilities on human exams and revealed their impressive competence in different domains.
We conduct an evaluation using MoocRadar, a meticulously annotated human test dataset based on Bloom taxonomy.
arXiv Detail & Related papers (2023-10-12T09:55:45Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Do Large Language Models Know What They Don't Know? [74.65014158544011]
Large language models (LLMs) have a wealth of knowledge that allows them to excel in various Natural Language Processing (NLP) tasks.
Despite their vast knowledge, LLMs are still limited by the amount of information they can accommodate and comprehend.
This study aims to evaluate LLMs' self-knowledge by assessing their ability to identify unanswerable or unknowable questions.
arXiv Detail & Related papers (2023-05-29T15:30:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.