Introducing a framework to assess newly created questions with Natural
Language Processing
- URL: http://arxiv.org/abs/2004.13530v2
- Date: Wed, 6 May 2020 09:06:57 GMT
- Title: Introducing a framework to assess newly created questions with Natural
Language Processing
- Authors: Luca Benedetto, Andrea Cappelli, Roberto Turrin, Paolo Cremonesi
- Abstract summary: We propose a framework to train and evaluate models for estimating the difficulty and discrimination of newly created Multiple Choice Questions.
We implement one model using this framework and test it on a real-world dataset provided by CloudAcademy.
- Score: 3.364554138758565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statistical models such as those derived from Item Response Theory (IRT)
enable the assessment of students on a specific subject, which can be useful
for several purposes (e.g., learning path customization, drop-out prediction).
However, the questions have to be assessed as well and, although it is possible
to estimate with IRT the characteristics of questions that have already been
answered by several students, this technique cannot be used on newly generated
questions. In this paper, we propose a framework to train and evaluate models
for estimating the difficulty and discrimination of newly created Multiple
Choice Questions by extracting meaningful features from the text of the
question and of the possible choices. We implement one model using this
framework and test it on a real-world dataset provided by CloudAcademy, showing
that it outperforms previously proposed models, reducing by 6.7% the RMSE for
difficulty estimation and by 10.8% the RMSE for discrimination estimation. We
also present the results of an ablation study performed to support our features
choice and to show the effects of different characteristics of the questions'
text on difficulty and discrimination.
Related papers
- Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy [0.0]
This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori.
We conduct two experiments to evaluate the use of large language models (LLM) for grading challenging student answers.
arXiv Detail & Related papers (2024-09-26T14:51:40Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - A Model-free Closeness-of-influence Test for Features in Supervised
Learning [23.345517302581044]
We study the question of assessing the difference of influence that the two given features have on the response value.
We first propose a notion of closeness for the influence of features, and show that our definition recovers the familiar notion of the magnitude of coefficients in the model.
We then propose a novel method to test for the closeness of influence in general model-free supervised learning problems.
arXiv Detail & Related papers (2023-06-20T19:20:18Z) - Improving Selective Visual Question Answering by Learning from Your
Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong.
We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions.
Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z) - Revealing Model Biases: Assessing Deep Neural Networks via Recovered
Sample Analysis [9.05607520128194]
This paper proposes a straightforward and cost-effective approach to assess whether a deep neural network (DNN) relies on the primary concepts of training samples.
The proposed method does not require any test or generalization samples, only the parameters of the trained model and the training data that lie on the margin.
arXiv Detail & Related papers (2023-06-10T11:20:04Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - A New Score for Adaptive Tests in Bayesian and Credal Networks [64.80185026979883]
A test is adaptive when its sequence and number of questions is dynamically tuned on the basis of the estimated skills of the taker.
We present an alternative family of scores, based on the mode of the posterior probabilities, and hence easier to explain.
arXiv Detail & Related papers (2021-05-25T20:35:42Z) - Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances.
Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers.
We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z) - Educational Question Mining At Scale: Prediction, Analysis and
Personalization [35.42197158180065]
We propose a framework for mining insights from educational questions at scale.
We utilize the state-of-the-art Bayesian deep learning method, in particular partial variational auto-encoders (p-VAE)
We apply our proposed framework to a real-world dataset with tens of thousands of questions and tens of millions of answers from an online education platform.
arXiv Detail & Related papers (2020-03-12T19:07:49Z) - R2DE: a NLP approach to estimating IRT parameters of newly generated
questions [3.364554138758565]
R2DE is a model capable of assessing newly generated multiple-choice questions by looking at the text of the question.
In particular, it can estimate the difficulty and the discrimination of each question.
arXiv Detail & Related papers (2020-01-21T14:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.