Related papers: Synthetic Context Generation for Question Generation

Synthetic Context Generation for Question Generation

URL: http://arxiv.org/abs/2406.13188v1
Date: Wed, 19 Jun 2024 03:37:52 GMT
Title: Synthetic Context Generation for Question Generation
Authors: Naiming Liu, Zichao Wang, Richard Baraniuk,
Abstract summary: This paper investigates training QG models using synthetic contexts generated by large language models. We find that contexts are essential for QG tasks, even if they are synthetic.
Score: 6.226609932118123
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite rapid advancements in large language models (LLMs), QG remains a challenging problem due to its complicated process, open-ended nature, and the diverse settings in which question generation occurs. A common approach to address these challenges involves fine-tuning smaller, custom models using datasets containing background context, question, and answer. However, obtaining suitable domain-specific datasets with appropriate context is often more difficult than acquiring question-answer pairs. In this paper, we investigate training QG models using synthetic contexts generated by LLMs from readily available question-answer pairs. We conduct a comprehensive study to answer critical research questions related to the performance of models trained on synthetic contexts and their potential impact on QG research and applications. Our empirical results reveal: 1) contexts are essential for QG tasks, even if they are synthetic; 2) fine-tuning smaller language models has the capability of achieving better performances as compared to prompting larger language models; and 3) synthetic context and real context could achieve comparable performances. These findings highlight the effectiveness of synthetic contexts in QG and paves the way for future advancements in the field.

Related papers

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones. Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data [9.390415313514762]
TARGA is a framework that generates high-relevance synthetic data without manual annotation. It substantially outperforms existing non-fine-tuned methods that utilize close-sourced model. It exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
arXiv Detail & Related papers (2024-12-27T09:16:39Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z)
Context Matters: An Empirical Study of the Impact of Contextual Information in Temporal Question Answering Systems [7.393290178125003]
This paper empirically examines the robustness of temporal question-answering systems trained on various context types. We show that training with a mix of these contexts enhances model robustness and accuracy. We introduce two new context-rich TQA datasets, ContextAQA and ContextTQE, and provide comprehensive evaluations and guidelines for training robust TQA models.
arXiv Detail & Related papers (2024-06-27T21:31:30Z)
Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets [7.52684798377727]
We introduce Syn-(QA)$2$, a set of two synthetically generated question-answering (QA) datasets. We find that false assumptions in QA are challenging, echoing the findings of prior work. The detection task is more challenging with long-tail questions compared to naturally occurring questions.
arXiv Detail & Related papers (2024-03-18T18:01:26Z)
Qsnail: A Questionnaire Dataset for Sequential Question Generation [76.616068047362]
We present the first dataset specifically constructed for the questionnaire generation task, which comprises 13,168 human-written questionnaires. We conduct experiments on Qsnail, and the results reveal that retrieval models and traditional generative models do not fully align with the given research topic and intents. Despite enhancements through the chain-of-thought prompt and finetuning, questionnaires generated by language models still fall short of human-written questionnaires.
arXiv Detail & Related papers (2024-02-22T04:14:10Z)
Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation [3.948068081583197]
This paper proposes a methodology that handle the out-of-domain scenario in Textbook question answering (TQA) Through supervised fine-tuning of the LLM model Llama-2 and the incorporation of RAG, our architecture outperforms the baseline, achieving a 4.12% accuracy improvement on validation set and 9.84% on test set for non-diagram multiple-choice questions.
arXiv Detail & Related papers (2024-02-05T11:58:56Z)
Context Matters: Pushing the Boundaries of Open-Ended Answer Generation with Graph-Structured Knowledge Context [4.1229332722825]
This paper introduces a novel framework that combines graph-driven context retrieval in conjunction to knowledge graphs based enhancement. We conduct experiments on various Large Language Models (LLMs) with different parameter sizes to evaluate their ability to ground knowledge and determine factual accuracy in answers to open-ended questions. Our methodology GraphContextGen consistently outperforms dominant text-based retrieval systems, demonstrating its robustness and adaptability to a larger number of use cases.
arXiv Detail & Related papers (2024-01-23T11:25:34Z)
Learning to Filter Context for Retrieval-Augmented Generation [75.18946584853316]
Generation models are required to generate outputs given partially or entirely irrelevant passages. FILCO identifies useful context based on lexical and information-theoretic approaches. It trains context filtering models that can filter retrieved contexts at test time.
arXiv Detail & Related papers (2023-11-14T18:41:54Z)
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data [0.0]
We leverage advances in high resolution text-to-image generation to develop a framework for generating evaluation data for multi-modal reasoning tasks. We apply this framework to generate context-dependent anomaly data, creating a synthetic dataset on a challenging task. We demonstrate that while the task is tractable, the model performs significantly worse on the context-dependent anomaly detection task than on standard VQA tasks.
arXiv Detail & Related papers (2023-06-01T20:56:34Z)
QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering [122.84513233992422]
We propose a new model, QA-GNN, which addresses the problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) We show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning.
arXiv Detail & Related papers (2021-04-13T17:32:51Z)
COSMO: Conditional SEQ2SEQ-based Mixture Model for Zero-Shot Commonsense Question Answering [50.65816570279115]
Identification of the implicit causes and effects of a social context is the driving capability which can enable machines to perform commonsense reasoning. Current approaches in this realm lack the ability to perform commonsense reasoning upon facing an unseen situation. We present Conditional SEQ2SEQ-based Mixture model (COSMO), which provides us with the capabilities of dynamic and diverse content generation.
arXiv Detail & Related papers (2020-11-02T07:08:19Z)
Understanding Unnatural Questions Improves Reasoning over Text [54.235828149899625]
Complex question answering (CQA) over raw text is a challenging task. Learning an effective CQA model requires large amounts of human-annotated data. We address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions.
arXiv Detail & Related papers (2020-10-19T10:22:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.