Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
- URL: http://arxiv.org/abs/2405.20526v1
- Date: Thu, 30 May 2024 22:57:49 GMT
- Title: Automated Generation and Tagging of Knowledge Components from Multiple-Choice Questions
- Authors: Steven Moore, Robin Schmucker, Tom Mitchell, John Stamper,
- Abstract summary: We employ GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E-Learning.
We analyzed discrepancies between the KCs generated by the Large Language Model (LLM) and those made by humans.
We also developed an induction algorithm to cluster questions that assess similar KCs based on their content.
- Score: 2.6644846626273457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Components (KCs) linked to assessments enhance the measurement of student learning, enrich analytics, and facilitate adaptivity. However, generating and linking KCs to assessment items requires significant effort and domain-specific knowledge. To streamline this process for higher-education courses, we employed GPT-4 to generate KCs for multiple-choice questions (MCQs) in Chemistry and E-Learning. We analyzed discrepancies between the KCs generated by the Large Language Model (LLM) and those made by humans through evaluation from three domain experts in each subject area. This evaluation aimed to determine whether, in instances of non-matching KCs, evaluators showed a preference for the LLM-generated KCs over their human-created counterparts. We also developed an ontology induction algorithm to cluster questions that assess similar KCs based on their content. Our most effective LLM strategy accurately matched KCs for 56% of Chemistry and 35% of E-Learning MCQs, with even higher success when considering the top five KC suggestions. Human evaluators favored LLM-generated KCs, choosing them over human-assigned ones approximately two-thirds of the time, a preference that was statistically significant across both domains. Our clustering algorithm successfully grouped questions by their underlying KCs without needing explicit labels or contextual information. This research advances the automation of KC generation and classification for assessment items, alleviating the need for student data or predefined KC labels.
Related papers
- MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark [51.73839215956791]
We introduce a novel taxonomy that categorizes the key capabilities required for reading comprehension (RC)
Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as sample generators and selection judges.
MRCEval is a comprehensive, challenging and accessible benchmark, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions.
arXiv Detail & Related papers (2025-03-10T10:20:05Z) - CritiQ: Mining Data Quality Criteria from Human Preferences [70.35346554179036]
We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality.
CritiQ Flow employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.
We demonstrate the effectiveness of our method in the code, math, and logic domains.
arXiv Detail & Related papers (2025-02-26T16:33:41Z) - Automated Knowledge Component Generation and Knowledge Tracing for Coding Problems [2.1464087136305774]
Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills.
We present a fully automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems.
We conduct a human evaluation to show that the KC tagging accuracy of our pipeline is reasonably accurate when compared to that by human domain experts.
arXiv Detail & Related papers (2025-02-25T20:40:51Z) - Sparse Binary Representation Learning for Knowledge Tracing [0.0]
Knowledge tracing (KT) models aim to predict students' future performance based on their historical interactions.
Most existing KT models rely exclusively on human-defined knowledge concepts associated with exercises.
We propose a KT model, Sparse Binary Representation KT (SBRKT), that generates new KC labels, referred to as auxiliary KCs.
arXiv Detail & Related papers (2025-01-17T00:45:10Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing [59.480951050911436]
We present KCQRL, a framework for automated knowledge concept annotation and question representation learning.
We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets.
arXiv Detail & Related papers (2024-10-02T16:37:19Z) - AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions.
We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4.
Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z) - ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot [2.186726107112913]
We propose a model-based evaluation method: TALEC.
It allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria.
TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments.
arXiv Detail & Related papers (2024-06-25T10:02:42Z) - Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data.
The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers.
To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Evaluating the Knowledge Dependency of Questions [12.25396414711877]
We propose a novel automatic evaluation metric, coined Knowledge Dependent Answerability (KDA)
We first show how to measure KDA based on student responses from a human survey.
Then, we propose two automatic evaluation metrics, KDA_disc and KDA_cont, that approximate KDA by leveraging pre-trained language models to imitate students' problem-solving behavior.
arXiv Detail & Related papers (2022-11-21T23:08:30Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - Classifying Math KCs via Task-Adaptive Pre-Trained BERT [14.53486865876146]
This work significantly improves prior research by expanding the input types to include KC descriptions, instructional video titles, and problem descriptions.
We also propose a simple evaluation measure by which we can recover 56-73% of mispredicted KC labels.
arXiv Detail & Related papers (2021-05-24T15:27:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.