Related papers: Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

URL: http://arxiv.org/abs/2205.12604v1
Date: Wed, 25 May 2022 09:28:21 GMT
Title: Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation
Authors: Dheeraj Mekala, Tu Vu, Jingbo Shang
Abstract summary: We improve generative data augmentation by formulating the data generation as context generation task. We cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. We demonstrate substantial improvements in performance in few-shot, zero-shot settings.
Score: 32.83012699501051
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Manually annotating datasets requires domain experts to read through many documents and carefully label them, which is often expensive. Recently, pre-trained generative language models (GLMs) have demonstrated exceptional abilities in generating text which motivates to leverage them for generative data augmentation. We improve generative data augmentation by formulating the data generation as context generation task and use question answering (QA) datasets for intermediate training. Specifically, we view QA to be more as a format than of a task and train GLMs as context generators for a given question and its respective answer. Then, we cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which is further used as synthetic training data for their corresponding tasks. We perform extensive experiments, case studies, and ablation studies on multiple sentiment and topic classification datasets and demonstrate substantial improvements in performance in few-shot, zero-shot settings. Remarkably, on the SST-2 dataset, intermediate training on SocialIQA dataset achieves an improvement of 40% on Macro-F1 score. Through thorough analyses, we observe that QA datasets that requires high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.

Related papers

TARGET: Benchmarking Table Retrieval for Generative Tasks [7.379012456053551]
TARGET is a benchmark for evaluating TAble Retrieval for GEnerative Tasks.<n>We analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks.<n>We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text.
arXiv Detail & Related papers (2025-05-14T19:39:46Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z)
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA) Recent efforts primarily focus on scaling up training datasets through data collection and synthesis. We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z)
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA [9.659820850719413]
We leverage Large Language Models (LLMs), which have shown to have strong reasoning ability, as an automatic data annotator. Key innovation in our method lies in the Synthesize Step-by-Step strategy. We significantly enhance the chart VQA models, achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets.
arXiv Detail & Related papers (2024-03-25T03:02:27Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
QASnowball: An Iterative Bootstrapping Framework for High-Quality Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball) QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples. We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents. To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.