STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models
- URL: http://arxiv.org/abs/2305.15090v3
- Date: Tue, 20 Feb 2024 20:00:21 GMT
- Title: STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models
- Authors: Mingyu Derek Ma, Xiaoxuan Wang, Po-Nien Kung, P. Jeffrey Brantingham,
Nanyun Peng, Wei Wang
- Abstract summary: STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
- Score: 56.27786433792638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information extraction tasks such as event extraction require an in-depth
understanding of the output structure and sub-task dependencies. They heavily
rely on task-specific training data in the form of (passage, target structure)
pairs to obtain reasonable performance. However, obtaining such data through
human annotation is costly, leading to a pressing need for low-resource
information extraction approaches that require minimal human labeling for
real-world applications. Fine-tuning supervised models with synthesized
training data would be a generalizable method, but the existing data generation
methods either still rely on large-scale ground-truth data or cannot be applied
to complicated IE tasks due to their poor performance. To address these
challenges, we propose STAR, a data generation method that leverages Large
Language Models (LLMs) to synthesize data instances given limited seed
demonstrations, thereby boosting low-resource information extraction
performance. Our approach involves generating target structures (Y) followed by
generating passages (X), all accomplished with the aid of LLMs. We design
fine-grained step-by-step instructions to obtain the initial data instances. We
further reduce errors and improve data quality through self-reflection error
identification and self-refinement with iterative revision. Our experiments
show that the data generated by STAR significantly improve the performance of
low-resource event extraction and relation extraction tasks, even surpassing
the effectiveness of human-curated data. Human assessment of the data quality
shows STAR-generated data exhibits higher passage quality and better align with
the task definitions compared with the human-curated data.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond [38.89457061559469]
We propose an innovative methodology that automates dataset creation with negligible cost and high efficiency.
We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data.
We design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning.
arXiv Detail & Related papers (2024-08-21T04:45:12Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Combining Public Human Activity Recognition Datasets to Mitigate Labeled
Data Scarcity [1.274578243851308]
We propose a novel strategy to combine publicly available datasets with the goal of learning a generalized HAR model.
Our experimental evaluation, which includes experimenting with different state-of-the-art neural network architectures, shows that combining public datasets can significantly reduce the number of labeled samples.
arXiv Detail & Related papers (2023-06-23T18:51:22Z) - Semi-supervised Relation Extraction via Data Augmentation and
Consistency-training [2.2209333405427585]
Semi-supervised learning methods aim to leverage unlabelled data in addition to learning from limited labelled data points.
Recently, strong data augmentation combined with consistency-based semi-supervised learning methods have advanced the state of the art in several SSL tasks.
In this work, we leverage the recent advances in controlled text generation to perform high quality data augmentation for the Relation extraction task.
arXiv Detail & Related papers (2023-06-16T19:45:42Z) - Gradient Imitation Reinforcement Learning for General Low-Resource
Information Extraction [80.64518530825801]
We develop a Gradient Reinforcement Learning (GIRL) method to encourage pseudo-labeled data to imitate the gradient descent direction on labeled data.
We also leverage GIRL to solve all IE sub-tasks (named entity recognition, relation extraction, and event extraction) in low-resource settings.
arXiv Detail & Related papers (2022-11-11T05:37:19Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - DQI: Measuring Data Quality in NLP [22.54066527822898]
We introduce a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of unwanted biases.
We show that models trained on the renovated SNLI dataset generalize better to out of distribution tasks.
arXiv Detail & Related papers (2020-05-02T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.