Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and
the Case of Information Extraction
- URL: http://arxiv.org/abs/2303.04132v2
- Date: Sun, 29 Oct 2023 14:24:46 GMT
- Title: Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and
the Case of Information Extraction
- Authors: Martin Josifoski, Marija Sakota, Maxime Peyrard, Robert West
- Abstract summary: This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by large language models.
We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation.
- Score: 28.51694365908817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have great potential for synthetic data
generation. This work shows that useful data can be synthetically generated
even for tasks that cannot be solved directly by LLMs: for problems with
structured outputs, it is possible to prompt an LLM to perform the task in the
reverse direction, by generating plausible input text for a target output
structure. Leveraging this asymmetry in task difficulty makes it possible to
produce large-scale, high-quality data for complex tasks. We demonstrate the
effectiveness of this approach on closed information extraction, where
collecting ground-truth data is challenging, and no satisfactory dataset exists
to date. We synthetically generate a dataset of 1.8M data points, establish its
superior quality compared to existing datasets in a human evaluation, and use
it to finetune small models (220M and 770M parameters), termed SynthIE, that
outperform the prior state of the art (with equal model size) by a substantial
margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data,
and models are available at https://github.com/epfl-dlab/SynthIE.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - SUMIE: A Synthetic Benchmark for Incremental Entity Summarization [6.149024468471498]
No existing dataset adequately tests how well language models can incrementally update entity summaries.
We introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges.
This dataset effectively highlights problems like incorrect entity association and incomplete information presentation.
arXiv Detail & Related papers (2024-06-07T16:49:21Z) - TarGEN: Targeted Data Generation with Large Language Models [54.1093098278564]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks [61.51515750218049]
This paper focuses on the Data Augmentation for low-resource Natural Language Understanding (NLU) tasks.
We propose Prompt-based Data Augmentation model (PromDA) which only trains small-scale Soft Prompt.
PromDA generates synthetic data via two different views and filters out the low-quality data using NLU models.
arXiv Detail & Related papers (2022-02-25T05:09:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.