Comparison of Large Language Models for Generating Contextually Relevant Questions
- URL: http://arxiv.org/abs/2407.20578v2
- Date: Sun, 15 Sep 2024 07:23:10 GMT
- Title: Comparison of Large Language Models for Generating Contextually Relevant Questions
- Authors: Ivo Lodovico Molina, Valdemar Švábenský, Tsubasa Minematsu, Li Chen, Fumiya Okubo, Atsushi Shimada,
- Abstract summary: GPT-3.5, Llama 2-Chat 13B, and T5 XXL are compared in their ability to create questions from university slide text without fine-tuning.
Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment.
- Score: 6.080820450677854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study explores the effectiveness of Large Language Models (LLMs) for Automatic Question Generation in educational settings. Three LLMs are compared in their ability to create questions from university slide text without fine-tuning. Questions were obtained in a two-step pipeline: first, answer phrases were extracted from slides using Llama 2-Chat 13B; then, the three models generated questions for each answer. To analyze whether the questions would be suitable in educational applications for students, a survey was conducted with 46 students who evaluated a total of 246 questions across five metrics: clarity, relevance, difficulty, slide relation, and question-answer alignment. Results indicate that GPT-3.5 and Llama 2-Chat 13B outperform Flan T5 XXL by a small margin, particularly in terms of clarity and question-answer alignment. GPT-3.5 especially excels at tailoring questions to match the input answers. The contribution of this research is the analysis of the capacity of LLMs for Automatic Question Generation in education.
Related papers
- "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF) [36.74896284581596]
We propose the Multimodal Short Answer Grading with Feedback problem along with a dataset of 2197 data points.
Our evaluations on existing Large Language Models (LLMs) over this dataset achieved an overall accuracy of 55% on the Level of Correctness labels.
As per human experts, Pixtral was more aligned towards human judgement and values for biology and ChatGPT for physics and chemistry.
arXiv Detail & Related papers (2024-12-27T17:33:39Z) - Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks.
We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z) - Which questions should I answer? Salience Prediction of Inquisitive Questions [118.097974193544]
We show that highly salient questions are empirically more likely to be answered in the same article.
We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
arXiv Detail & Related papers (2024-04-16T21:33:05Z) - Explainable Multi-hop Question Generation: An End-to-End Approach without Intermediate Question Labeling [6.635572580071933]
Multi-hop question generation aims to generate complex questions that requires multi-step reasoning over several documents.
Previous studies have predominantly utilized end-to-end models, wherein questions are decoded based on the representation of context documents.
This paper introduces an end-to-end question rewriting model that increases question complexity through sequential rewriting.
arXiv Detail & Related papers (2024-03-31T06:03:54Z) - Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations [70.6395572287422]
Self-alignment method is capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions.
We conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired.
arXiv Detail & Related papers (2024-02-23T02:24:36Z) - Qsnail: A Questionnaire Dataset for Sequential Question Generation [76.616068047362]
We present the first dataset specifically constructed for the questionnaire generation task, which comprises 13,168 human-written questionnaires.
We conduct experiments on Qsnail, and the results reveal that retrieval models and traditional generative models do not fully align with the given research topic and intents.
Despite enhancements through the chain-of-thought prompt and finetuning, questionnaires generated by language models still fall short of human-written questionnaires.
arXiv Detail & Related papers (2024-02-22T04:14:10Z) - Prompt-Engineering and Transformer-based Question Generation and
Evaluation [0.0]
This paper aims to find the best method to generate questions from textual data through a transformer model and prompt engineering.
The generated questions were compared against the baseline questions in the SQuAD dataset to evaluate the effectiveness of four different prompts.
arXiv Detail & Related papers (2023-10-29T01:45:30Z) - Are Large Language Models Fit For Guided Reading? [6.85316573653194]
This paper looks at the ability of large language models to participate in educational guided reading.
We evaluate their ability to generate meaningful questions from the input text, generate diverse questions and recommend part of the text that a student should re-read.
arXiv Detail & Related papers (2023-05-18T02:03:55Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - "What makes a question inquisitive?" A Study on Type-Controlled
Inquisitive Question Generation [35.87102025753666]
We propose a type-controlled framework for inquisitive question generation.
We generate a variety of questions that adhere to specific types while drawing from the source texts.
We also investigate strategies for selecting a single question from a generated set.
arXiv Detail & Related papers (2022-05-17T02:05:50Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.