Evaluating the Knowledge Dependency of Questions
- URL: http://arxiv.org/abs/2211.11902v1
- Date: Mon, 21 Nov 2022 23:08:30 GMT
- Title: Evaluating the Knowledge Dependency of Questions
- Authors: Hyeongdon Moon, Yoonseok Yang, Jamin Shin, Hangyeol Yu, Seunghyun Lee,
Myeongho Jeong, Juneyoung Park, Minsam Kim, Seungtaek Choi
- Abstract summary: We propose a novel automatic evaluation metric, coined Knowledge Dependent Answerability (KDA)
We first show how to measure KDA based on student responses from a human survey.
Then, we propose two automatic evaluation metrics, KDA_disc and KDA_cont, that approximate KDA by leveraging pre-trained language models to imitate students' problem-solving behavior.
- Score: 12.25396414711877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The automatic generation of Multiple Choice Questions (MCQ) has the potential
to reduce the time educators spend on student assessment significantly.
However, existing evaluation metrics for MCQ generation, such as BLEU, ROUGE,
and METEOR, focus on the n-gram based similarity of the generated MCQ to the
gold sample in the dataset and disregard their educational value. They fail to
evaluate the MCQ's ability to assess the student's knowledge of the
corresponding target fact. To tackle this issue, we propose a novel automatic
evaluation metric, coined Knowledge Dependent Answerability (KDA), which
measures the MCQ's answerability given knowledge of the target fact.
Specifically, we first show how to measure KDA based on student responses from
a human survey. Then, we propose two automatic evaluation metrics, KDA_disc and
KDA_cont, that approximate KDA by leveraging pre-trained language models to
imitate students' problem-solving behavior. Through our human studies, we show
that KDA_disc and KDA_soft have strong correlations with both (1) KDA and (2)
usability in an actual classroom setting, labeled by experts. Furthermore, when
combined with n-gram based similarity metrics, KDA_disc and KDA_cont are shown
to have a strong predictive power for various expert-labeled MCQ quality
measures.
Related papers
- A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics [6.571049277167304]
We study the statistics of the existing evaluation metrics for a better understanding of their limitations.
As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
arXiv Detail & Related papers (2024-10-13T22:10:42Z) - An Automatic Question Usability Evaluation Toolkit [1.2499537119440245]
evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability.
We introduce SAQUET, an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs.
With an accuracy rate of over 94%, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
arXiv Detail & Related papers (2024-05-30T23:04:53Z) - AQuA -- Combining Experts' and Non-Experts' Views To Assess Deliberation Quality in Online Discussions Using LLMs [0.9737366359397255]
AQuA is an additive score that calculates a unified deliberative quality score from multiple indices for each discussion post.
We develop adapter models for 20 deliberative indices, and calculate correlation coefficients between experts' annotations and the perceived deliberativeness by non-experts to weigh the individual indices into a single deliberative score.
arXiv Detail & Related papers (2024-04-03T14:07:02Z) - K-QA: A Real-World Medical Q&A Benchmark [12.636564634626422]
We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health.
We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements.
We evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes.
arXiv Detail & Related papers (2024-01-25T20:11:04Z) - Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - Cognitive Diagnosis with Explicit Student Vector Estimation and
Unsupervised Question Matrix Learning [53.79108239032941]
We propose an explicit student vector estimation (ESVE) method to estimate the student vectors of DINA.
We also propose an unsupervised method called bidirectional calibration algorithm (HBCA) to label the Q-matrix automatically.
The experimental results on two real-world datasets show that ESVE-DINA outperforms the DINA model on accuracy and that the Q-matrix labeled automatically by HBCA can achieve performance comparable to that obtained with the manually labeled Q-matrix.
arXiv Detail & Related papers (2022-03-01T03:53:19Z) - Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z) - QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question
Answering [122.84513233992422]
We propose a new model, QA-GNN, which addresses the problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs)
We show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning.
arXiv Detail & Related papers (2021-04-13T17:32:51Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.