Generating Data for Symbolic Language with Large Language Models
- URL: http://arxiv.org/abs/2305.13917v1
- Date: Tue, 23 May 2023 10:44:00 GMT
- Title: Generating Data for Symbolic Language with Large Language Models
- Authors: Jiacheng Ye, Chengzu Li, Lingpeng Kong, Tao Yu
- Abstract summary: Large language models (LLMs) have been developed to generate data for natural language tasks.
In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data.
We show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model.
- Score: 16.529863710055004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) bring not only performance but also
complexity, recent work has started to turn LLMs into data generators rather
than task inferencers, where another affordable task model is trained for
efficient deployment and inference. However, such an approach has primarily
been applied to natural language tasks and has not yet been explored for
symbolic language tasks with complex structured outputs (e.g., semantic parsing
and code generation). In this paper, we propose SymGen which utilizes LLMs for
generating various annotation-expensive symbolic language data. SymGen consists
of an informative prompt to steer generation and an agreement-based verifier to
improve data correctness. We conduct extensive experiments on six symbolic
language tasks across various settings. Compared with the LLMs, we demonstrate
the 1\%-sized task model can achieve comparable or better performance, largely
cutting inference and deployment costs. We also show that generated data with
only a few human demonstrations can be as effective as over 10 times the amount
of human-annotated data when training the task model, saving a considerable
amount of annotation effort. SymGen sheds new light on data generation for
complex tasks, and we release the code at
\href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection [23.575482348558904]
Large language models (LLMs) are very proficient text generators.
We leverage this capability to generate task-specific data via zero-shot prompting.
We observe significant performance gains across sentiment analysis and natural language inference tasks.
arXiv Detail & Related papers (2024-07-15T10:00:22Z) - MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic [6.46176287368784]
We propose textbfModel textbfExclusive textbfTask textbfArithmetic for merging textbfGPT-scale models.
Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.
arXiv Detail & Related papers (2024-06-17T10:12:45Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - GenSim: Generating Robotic Simulation Tasks via Large Language Models [34.79613485106202]
GenSim aims to automatically generate rich simulation environments and expert demonstrations.
We use GPT4 to expand the existing benchmark by ten times to over 100 tasks.
With minimal sim-to-real adaptation, multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world.
arXiv Detail & Related papers (2023-10-02T17:23:48Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form.
We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z) - Exploring Versatile Generative Language Model Via Parameter-Efficient
Transfer Learning [70.81910984985683]
We propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pre-trained model.
The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.
arXiv Detail & Related papers (2020-04-08T06:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.