Related papers: A quantitative study of NLP approaches to question difficulty estimation

A quantitative study of NLP approaches to question difficulty estimation

URL: http://arxiv.org/abs/2305.10236v1
Date: Wed, 17 May 2023 14:26:00 GMT
Title: A quantitative study of NLP approaches to question difficulty estimation
Authors: Luca Benedetto
Abstract summary: This work quantitatively analyzes several approaches proposed in previous research, and comparing their performance on datasets from different educational domains. We find that Transformer based models are the best performing across different educational domains, with DistilBERT performing almost as well as BERT. As for the other models, the hybrid ones often outperform the ones based on a single type of features, the ones based on linguistic features perform well on reading comprehension questions, while frequency based features (TF-IDF) and word embeddings (word2vec) perform better in domain knowledge assessment.
Score: 0.30458514384586394
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years witnessed an increase in the amount of research on the task of Question Difficulty Estimation from Text QDET with Natural Language Processing (NLP) techniques, with the goal of targeting the limitations of traditional approaches to question calibration. However, almost the entirety of previous research focused on single silos, without performing quantitative comparisons between different models or across datasets from different educational domains. In this work, we aim at filling this gap, by quantitatively analyzing several approaches proposed in previous research, and comparing their performance on three publicly available real world datasets containing questions of different types from different educational domains. Specifically, we consider reading comprehension Multiple Choice Questions (MCQs), science MCQs, and math questions. We find that Transformer based models are the best performing across different educational domains, with DistilBERT performing almost as well as BERT, and that they outperform other approaches even on smaller datasets. As for the other models, the hybrid ones often outperform the ones based on a single type of features, the ones based on linguistic features perform well on reading comprehension questions, while frequency based features (TF-IDF) and word embeddings (word2vec) perform better in domain knowledge assessment.

Related papers

Human Re-ID Meets LVLMs: What can we expect? [14.370360290704197]
We compare the performance of the leading large vision-language models in the human re-identification task. Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers.
arXiv Detail & Related papers (2025-01-30T19:00:40Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances. Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers. We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z)
What's the best place for an AI conference, Vancouver or ______: Why completing comparative questions is difficult [22.04829832439774]
We study the ability of neural LMs to ask (not answer) reasonable questions. We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable.
arXiv Detail & Related papers (2021-04-05T14:56:09Z)
Towards Few-Shot Fact-Checking via Perplexity [40.11397284006867]
We propose a new way of utilizing the powerful transfer learning ability of a language model via a perplexity score. Our methodology can already outperform the Major Class baseline by more than absolute 10% on the F1-Macro metric. We construct and publicly release two new fact-checking datasets related to COVID-19.
arXiv Detail & Related papers (2021-03-17T09:43:19Z)
Invariant Feature Learning for Sensor-based Human Activity Recognition [11.334750079923428]
We present an invariant feature learning framework (IFLF) that extracts common information shared across subjects and devices. Experiments demonstrated that IFLF is effective in handling both subject and device diversion across popular open datasets and an in-house dataset.
arXiv Detail & Related papers (2020-12-14T21:56:17Z)
Improving QA Generalization by Concurrent Modeling of Multiple Biases [61.597362592536896]
Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets. We propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data. We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths.
arXiv Detail & Related papers (2020-10-07T11:18:49Z)
Sentiment Analysis Based on Deep Learning: A Comparative Study [69.09570726777817]
The study of public opinion can provide us with valuable information. The efficiency and accuracy of sentiment analysis is being hindered by the challenges encountered in natural language processing. This paper reviews the latest studies that have employed deep learning to solve sentiment analysis problems.
arXiv Detail & Related papers (2020-06-05T16:28:10Z)
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions. Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.