Distractor generation for multiple-choice questions with predictive
prompting and large language models
- URL: http://arxiv.org/abs/2307.16338v1
- Date: Sun, 30 Jul 2023 23:15:28 GMT
- Title: Distractor generation for multiple-choice questions with predictive
prompting and large language models
- Authors: Semere Kiros Bitew, Johannes Deleu, Chris Develder and Thomas
Demeester
- Abstract summary: Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable performance across various tasks.
We propose a strategy for guiding LLMs in generating relevant distractors by prompting them with question items automatically retrieved from a question bank.
We found that on average 53% of the generated distractors presented to the teachers were rated as high-quality, i.e., suitable for immediate use as is.
- Score: 21.233186754403093
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLMs) such as ChatGPT have demonstrated remarkable
performance across various tasks and have garnered significant attention from
both researchers and practitioners. However, in an educational context, we
still observe a performance gap in generating distractors -- i.e., plausible
yet incorrect answers -- with LLMs for multiple-choice questions (MCQs). In
this study, we propose a strategy for guiding LLMs such as ChatGPT, in
generating relevant distractors by prompting them with question items
automatically retrieved from a question bank as well-chosen in-context
examples. We evaluate our LLM-based solutions using a quantitative assessment
on an existing test set, as well as through quality annotations by human
experts, i.e., teachers. We found that on average 53% of the generated
distractors presented to the teachers were rated as high-quality, i.e.,
suitable for immediate use as is, outperforming the state-of-the-art model. We
also show the gains of our approach 1 in generating high-quality distractors by
comparing it with a zero-shot ChatGPT and a few-shot ChatGPT prompted with
static examples.
Related papers
- AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - Large Language Models Are Self-Taught Reasoners: Enhancing LLM Applications via Tailored Problem-Solving Demonstrations [4.207253227315905]
We present SELF-TAUGHT, a problem-solving framework, which facilitates customized demonstrations.
In 15 tasks of multiple-choice questions, SELF-TAUGHT achieves superior performance to strong baselines.
We conduct comprehensive analyses on SELF-TAUGHT, including its generalizability to existing prompting methods.
arXiv Detail & Related papers (2024-08-22T11:41:35Z) - Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - Leveraging LLMs for Dialogue Quality Measurement [27.046917937460798]
Large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks.
Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures.
Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
arXiv Detail & Related papers (2024-06-25T06:19:47Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models [25.74741863885925]
We propose a new benchmark for long-context models based on a practical meeting assistant scenario.
Our benchmark, named ELITR-Bench, augments the existing ELITR corpus' transcripts with 271 manually crafted questions and their ground-truth answers.
Our findings suggest that while GPT-4's evaluation scores are correlated with human judges', its ability to differentiate among more than three score levels may be limited.
arXiv Detail & Related papers (2024-03-29T16:13:31Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Active Learning Principles for In-Context Learning with Large Language
Models [65.09970281795769]
This paper investigates how Active Learning algorithms can serve as effective demonstration selection methods for in-context learning.
We show that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.
arXiv Detail & Related papers (2023-05-23T17:16:04Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Learning to Reuse Distractors to support Multiple Choice Question
Generation in Education [19.408786425460498]
This paper studies how a large existing set of manually created answers and distractors can be leveraged to help teachers in creating new multiple choice questions (MCQs)
We built several data-driven models based on context-aware question and distractor representations, and compared them with static feature-based models.
Both automatic and human evaluations indicate that context-aware models consistently outperform a static feature-based approach.
arXiv Detail & Related papers (2022-10-25T12:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.