Related papers: UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

URL: http://arxiv.org/abs/2406.18966v3
Date: Fri, 23 Aug 2024 00:14:51 GMT
Title: UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Authors: Siyuan Wu, Yue Huang, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xiangliang Zhang, Jianfeng Gao, Chaowei Xiao, Lichao Sun,
Abstract summary: UniGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. Extensive experiments demonstrate the superior quality of data generated by UniGen.
Score: 88.16197692794707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

Related papers

Multi-Attribute Constraint Satisfaction via Language Model Rewriting [67.5778646504987]
Multi-Attribute Constraint Satisfaction (MACS) is a method capable of finetuning language models to satisfy user-specified constraints on multiple external real-value attributes. Our work opens new avenues for generalized and real-value multi-attribute control, with implications for diverse applications spanning NLP and bioinformatics.
arXiv Detail & Related papers (2024-12-26T12:36:39Z)
Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z)
Generative Fuzzy System for Sequence Generation [16.20988290308979]
We introduce the fuzzy system, a classical modeling method that combines data and knowledge-driven mechanisms, to generative tasks. We propose an end-to-end GenFS-based model for sequence generation, called FuzzyS2S. A series of experimental studies were conducted on 12 datasets, covering three distinct categories of generative tasks.
arXiv Detail & Related papers (2024-11-21T06:03:25Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
G-NeuroDAVIS: A Neural Network model for generalized embedding, data visualization and sample generation [0.0]
A novel generative model, called G-NeuroDAVIS, is capable of visualizing high-dimensional data through a generalized embedding. G-NeuroDAVIS can be trained in both supervised and unsupervised settings.
arXiv Detail & Related papers (2024-10-18T07:14:08Z)
A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models. We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z)
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset. Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [59.60208063956459]
Large Language Models (LLMs) require high quality instruction data for effective alignment. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions.
arXiv Detail & Related papers (2024-07-29T20:42:59Z)
Iterative Data Generation with Large Language Models for Aspect-based Sentiment Analysis [39.57537769578304]
We propose a systematic Iterative Data Generation framework, namely IDG, to boost the performance of ABSA. The core of IDG is to make full use of the powerful abilities (i.e., instruction-following, in-context learning and self-reflection) of LLMs to iteratively generate more fluent and diverse pseudo-label data. IDG brings consistent and significant performance gains among five baseline ABSA models.
arXiv Detail & Related papers (2024-06-29T07:00:37Z)
MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data [10.217822818544475]
We propose a framework to generate synthetic (tabular) data powered by large language models (LLMs)<n>Our approach significantly enhance the quality of synthetic data generation in common scenarios with small sample sizes.<n>Our results demonstrate that our model outperforms several state-of-art models regarding generating higher quality synthetic data for downstream tasks while keeping privacy of the real data.
arXiv Detail & Related papers (2024-06-15T06:26:17Z)
DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data [48.31817189858086]
We argue that generative data can expand the data distribution that the model can learn, thus mitigating overfitting. We find that DiverGen significantly outperforms the strong model X-Paste, achieving +1.1 box AP and +1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare categories.
arXiv Detail & Related papers (2024-05-16T15:30:18Z)
Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse [4.98050508891467]
We propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. We introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data.
arXiv Detail & Related papers (2024-03-14T08:27:32Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models [22.457013726785295]
We present textbfUniGen, a textbfUnified textbfGenerative framework for retrieval and question answering. UniGen integrates both tasks into a single generative model leveraging the capabilities of large language models.
arXiv Detail & Related papers (2023-12-18T09:13:41Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
DIG-MILP: a Deep Instance Generator for Mixed-Integer Linear Programming with Feasibility Guarantee [47.11455377400096]
Mixed-integer linear programming (MILP) stands as a notable NP-hard problem pivotal to numerous crucial industrial applications. We present DIG-MILP, a deep generative framework based on variational auto-encoder (VAE), adept at extracting deep-level structural features from highly limited MILP data.
arXiv Detail & Related papers (2023-10-20T03:45:29Z)
DAGAM: Data Augmentation with Generation And Modification [3.063234089519162]
In pre-trained language models, under-fitting often occurs due to the size of the model being very large. We introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models.
arXiv Detail & Related papers (2022-04-06T07:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.